Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Wenjun Bao, Chief Scientist, Sr. Manager, JMP Life Sciences, SAS Institute Inc Fang Hong, Dr., National Center for Toxicological Research, FDA Zhichao Liu, Dr., National Center for Toxicological Research, FDA Weida Tong, Dr., National Center for Toxicological Research, FDA Russ Wolfinger, Director of Scientific Discovery and Genomics, JMP Life Sciences, SAS   Monitoring the post-marketing safety of drug and therapeutic biologic products is very important to the protection of public health. To help facilitate the safety monitoring process, the FDA has established several database systems including the FDA Online Label Repository (FOLP). FOLP collects the most recent drug listing information companies have submitted to the FDA. However, navigating through hundreds of drug labels and extracting meaningful information is a challenge; an easy-to-use software solution could help.   The most frequent single cause of safety-related drug withdrawals from the market during the past 50 years has been drug-induced liver injury (DILI). In this presentation we analyze 462 drug labels with DILI indicators using JMP Text Explorer. Terms and phrases from the Warnings and Precautions section of the drug labels are matched to DILI keywords and MedDRA terms. The XGBoost add-in for JMP Pro is utilized to predict DILI indicators through cross validation of XGBoost predictive models by the term matrix. The results demonstrate that a similar approach can be readily used to analyze other drug safety concerns.        Auto-generated transcript...   Speaker Transcript Olivia Lippincott What wenjba It's my pleasure to talk about this, Obtain high quality information from FDA drug labeling system and in the JMP discovery. And today I'm going to talk about four portions. The first, I'll give some background information about drug post marketing monitoring and what is the effort from the FDA regulatory agency and industry. And also, I'm going to use a drug label data set to analyze the text and using the Text Explorer in JMP and also use the add in JMP add in XGBoost to analyze this DILI information and then give the conclusion and also the XGBoost tutorial by Dr. Russ Wolfinger is also present in this JMP Discovery Summit so please go to his tutorial if you're interested in XGBoost. So the drug development, according to FDA description for the drug processing, it can be divided by five stages and the first two stages, discover and research preclinical, many in the in the for the animal study and chemical screen, and then later three stages involve the human. And JMP has three products, including JMP Genomics, JMP Clinical, JMP and JMP Pro, that have covered every stage of the drug discovery. And JMP Genomics is the omics system that can be used for omics and clinical biomarkers selection and JMP Clinical is specific for the clinical trial and post marketing monitoring for the drug safety and efficacy. And also for the JMP Pro can be used for drug data cleaning, mining, target identification, formulation development, DOE, QbD, bioassay, etc. So it can be used every stage of the drug development. So in this drug development, there's most frequent single cause, called a DILI, actually can be stopped for the clinical trial. The drug can be rejected for approval by the FDA or the other regulatory agency, or be recalled once the drug is on market. So this is the most frequent the single cause called the DILI and can be found the information in the FDA guide and also other scientific publications. So what is DILI? This actually is drug-induced liver injury, called DILI and you have FDA, back almost more than 10 years ago in 2009, they published a guide for DILI, how to evaluation and follow up, FDA offers multiple years of the DILI training for the clinical investigator and those information can still find online today. And they have the conferences, also organized by FDA, just last year. And of course for the DILI, how you define the subject or patient could have a DILI case, they have Hy's Law that's included in the FDA guidance. So here's an example for the DILI evaluation for the clinical trial, here in the clinical trial, in the JMP Clinical by Hy's Law. So the Hy's Law is the combination condition for the several liver enzymes when they elevate to the certain level, then you would think it would be the possible Hy's Law cases. So you have potentially that liver damages. So here we use the color to identify the possible Hy's Law cases, the red one is a yes, blue one is a no. And also the different round and the triangle were from different treatment groups. We also use a JMP bubble plot to to show the the enzymes elevations through the time...timing...during the clinical trial period time. So this is typical. This is 15 days. Then you have the subject, starting was pretty normal. Then they go kind of crazy high level of the liver enzyme indicate they are potentially DILI possible cases. So, the FDA has two major databases, actually can deal with the post-marketing monitoring for the drug safety. One is a drug label and which we will get the data from this database. Another one is FDA Adverse Event Reporting System, they then they have from the NIH and and NCBI, they have very actively built this LiverTox and have lots of information, deal with the DILI. And the FDA have another database called Liver Toxic Knowledge Base and there was a leading by Dr. Tong, who is our co are so in this presentation. They have a lot of knowledge about the DILI and built this specific database for public information. So drug label. Everybody have probably seen this when you get prescription drug. You got those wordy thing paper come with your drug. So they come with also come with a teeny tiny font words in that even though it's sometimes it's too small to read, but they do contain many useful scientific information about this drug Then two potions will be related to my presentation today, would be the sections called warnings and precautions. So basically, all the information about the drug adverse event and anything need be be warned in these two sections. And this this drug actually have over 2000 words describe about the warnings and precautions. And fortunately, not every drug has that many side effects or adverse events. Some drugs like this, one of the Metformin, actually have a small section for the warning and precautions. So the older version of the drug label has warnings and precautions in the separate sections, and new version has them put together. So this one is in the new version they put...they have those two sections together. But this one has much less side effects. So JMP and the JMP clinical have made use by the FDA to perform the safety analysis and we actually help to finalize every adverse event listed in the drug labels. So this is data that got to present today. So we are using the warning and precaution section in the 462 drug labels that extracted by the FDA researchers and I just got from them. And the DILI indicator was assigned to each drug. 1 is yes and the zero is no. So from this this distribution, you can see there's about one...164 drugs has potential DILI cases and 298 doesn't and the original format for the drug label data is in the XML format and that can be imported by JMP multiply at once. So for the DILI keywords and was a many years effort by the FDA to come up this keyword list. Then they actually by the expert, reading hundreds of drug label and then decided what could potentially become the DILI cases. So then they come up with those about 44 words or terms to be indicated as a keyword, to be indicated for the drug could be the DILI cases. And you may also heard about MedDRA, which is a medical dictionary for regulatory activities. They have different levels of a standardized terms and most popular one is preferred term. I'm going to be using today. So in the warning and precaution, you can see if we pull everything together, you have over 12,000 terms in the warnings and the precautions section. And you can see that "patients" and "may" is a dominant which made not...should not be related to the medical cases and the medical information in this case. So we can remove that, you can see that not any other words are so dominant in this word cloud, but it still have many medical unrelated words like "use" and like "reported" that we could put into... could remove them to our analysis list. So in the in the Text Explorer, we can put them into the stop word and also we normally were using the different Text Explorer technology is stemming, tokenizing, regex, recoding and deleting manually. to clean up the list. But it had 12,000 terms, so it could be very time consuming. But since we have the list we are interested in, so we want to take advantage that we already knew what we are interested in the terms in this system. So what we're going to do and I'm going to show you in the demo that we'll only use the DILI keywords, plus the preferred term from the MedDRA to generate the interesting terms and the phrases to do the prediction. So here is the example we saw using only the DILI keywords. Then you see everything over here, you can see even in the list. You have a count number showed at the side for each of terms, how many times they are repeated in the warnings and precaution section and also you can see more colorful, more graphic in the world cloud to get a pattern recognized. And then we add the medical terms, that was the medical related terms. So it's still come down from the 12,000 terms to the 1190 terms that was including DILI keywords and medical preferred terms. So we think this would be the good term list to start with to do our analysis. So what we do is in the JMP Text Explorer, we can save the term...document term matrix. That means if you see 1 that means this document have seen this term, if it says, if this is 0, this means this document has not see, have a case of this word. So then we, in the XGBoost will make k fold, and three k folds, use each one with five columns. So we use in this machine learnign and use XGBoost tree model which is add in for the JMP Pro and we...using the DILI indicator to as a target variable and they use the DILI keywords and also the MedDRA preferred terms that have shown up more than 20 times to...as a predictor. Then we use a cross validation XGBoost then it 300 times interation. Now we got statistical performance metrics, we get term importance to DILI, and we get, we can use the prediction profiler for interactions and also we can generate and the save the prediction formula for new drug prediction. So I'm going to the demo. So this is a sample table we got in the in JMP. So you have a three columns. Basically you have the index, which is a drug ID. Then you have the warnign and precaution, it could have contain much more words that it's appeared, so basically have all the information for each drug. Now you have a DILI indicator. So we do the Text Explorer first. We have analysis, you go to the Text Explorer, you can use this input, which is a warning and precaution text and you would you...normally you can do different things over here, you can minimize characters, normally people go to 2 or do other things. Or you could use the stemming or you could use the regex and to do all kind of formula and in our limitation can be limited. For example, you can use a customize regex to get the all the numbers removed. That's if only number, you can remove those, but since we're going to use a list, we'll not touch any of those, we can just go here simply say, okay, So it come up the whole list of this, everything. So now I'm going to say, I only care about oh, for this one, you can do...you can show the word cloud. And we want to say I want to center it and also I want to the color. So you see this one, you see the patient is so dominant, then you can say, okay this definitely...not the... should not be in the in analysis. So I just select and right click add stop word. So you will see those being removed and no longer showed in your list and no longer show in the word cloud. So now I want to show you something I think that would speed up the clean up, because there's so many other words that could be in the system that I don't need. So I actually select and put everything into the stop word. So I removed everything, except I don't know why the "action" cannot be removed. And but it's fine if there's only one. So what I do is I go here. I said manage phrase, I want to import my keywords. Keyword just have a... very simple. The title, one column data just have all the name list. So I import that, I paste that into local. This will be my local library. And I said, Okay. So now I got only the keyword I have. OK, so now this one will be...I want to do the analysis later. And I want to use all of them to be included in my analysis because they are the keywords. So I go here, the red triangle, everything in the Text Explorer, hidden functions, hidden in this red triangle. So I say save matrix. So I want to have one and I want 44 up in my analysis. I say okay. So you will see, everything will get saved to my... to the column, the matrix. So now I want to what I want to add, I want to have the phrase, one more time. I also want to import those preferred terms. into the my database, my local data. Then also, I want to actually, I want to locally to so I say, okay. So now I have the mix, both of the the preferred terms from the MedDRA and also my keywords. So you can see now the phrases have changed. So that I can add them to my list. The same thing to my safe term matrix list and get the, the, all the numbers...all the terms I want to be included. And the one thing I want to point out here is for these terms and they are...we need to change the one model format. This is model type is continuing. I want to change them to nominal. I will tell you why I do that later. So now I have, I can go to the XGBoost, which is in the add in. We can make...k fold the columns that make sure I can do the cross validation. I can use just use index and by default is number of k fold column to create is three and the number of folds (k) is within each column is five, we just go with the default. Say, okay, it will generate three columns really quickly. And at the end, you are seeing fold A, B, C, three of them. So we got that, then we have... Another thing I wanted to do is in the... So we can We can create another phrase which has everything...that have have everything in...this phrase have everything, including the keywords and PT, but I want to create one that only have the only have only have the the preferred term, but not have the keyword, so I can add those keywords into the local exception and say, Okay. So those words will be only have preferred terms, but not have the keywords. So this way I can create another list, save another list of the documentation words than this one I want to have. So have 1000, but this term has just 20. So what they will do is they were saved terms either meet... have at least show up more than 20 times or they reach to 1000, which one of them, they will show up in the my list. So now I have table complete, which has the keywords and also have the MedDRA terms which have more than 20, show more than 20 times, now also have ??? column that ready for the analysis for the XGBoost. So now what I can do is go to the XGBoost. I can go for the analysis now. So what I'm going to do show you is I can use this DILI indicator, then the X response is all my terms that I just had for the keyword and the preferred words. Now, I use the three validation then click OK to run. It will take about five minutes to run. So I already got a result I want to show you. So you have... This is what look like. The tuning design. And we'll check this. You have the actual will find a good condition for you to to to do so. You can also, if you have as much as experience like Ross Wolfinger has, he will go in here, manually change some conditions, then you probably get the best result. But for the many people like myself don't have many experienced in XGBoost, I would rather use this tuning design than just have machine to select for me first, then I can go in, we can adjust a little bit, it depend on what I need to do here. So this is a result we got. You can use the...you can see here is different statistic metrics for performance metrics for this models and the default is showed only have accuracy and you can use sorting them by to click the column. You can sorting them and also it has much more other popular performance metrics like MCC, AUC, RMSE, correlation. They all show up if you click them. They will show up here. So whatever you need, whatever measurement you want to do, you can always find here. So now I'm going to use, say I trust the validation accuracy, more than anything else for this case. So I want to do is I want to see just top model, say five models. So what here is I choose five models. Then I go here, say I want to remove all the show models. So you will see the five models over here and then you can see some model, even though the, like this 19 is green, it doesn't the finish to the halfway. So something wrong, something is not appropriate for this model. I definitely don't want to use that one, so others I can choose. Say I want to choose this 19, I want to remove that one. So I can say I want to remove the hidden one. So basically just whatever you need to do. So if you compare, see this metrics, they're actually not much, not much different. So I want to rely on this graphic to help me to choose the best one to do the performance. So then you choose the good one. You can go here to say, I like the model 12 so I can go here, say I want to do the profiler. So this is a very powerful tool, I think quite unique to JMP. Not many tools have this function. So this gives you an opportunity to look at individual parameters in the in the active way and see how they how they change the result. For example those two was most frequently show up in the DILI cases. And you can see the slope is quite steep and that means if you change them, they will affect the final result predictions quite a bit. So you can see when the hepatitis and jaundice both says zero, you actually have very low possibility to get the DILI as one. So is low case for the possible DILI cases. But if you change this line, to the 1, you can see the chance you get is higher. And if you move those even higher. So you have, you will have a way to analyze, if they are the what is the key parameters or predictor to affect your result. And for this, some of them, even their keyword, they're pretty flat. So that means if you change that, it will not affect the result that much. So So this is and also we here, we gave the list you can get to to see what is the most important features to the calculate variables prediction. So you can see over here is jaundice and others are quite important. And for the for the feature result, once you get the data in, this is all the results that we we have. And you can say, well, what...how about the new things coming? Yes, we have here, you can say, I want to save prediction formula. And you can see it's actively working on that. And then in the table, by the end of table, you will see the prediction. So remember we had one...this was, say, well, the first drug, second was pretty much predict it will be the DILI cases and the next two, third, and the fourth, and the fifth was close to zero. So we go back to this DILI indicator and we found out they actually list. The first five was right one. So, in case you have...don't have this indicator when you have the new data come in, you don't have to read all the label. You run the model. You can see the prediction. Pretty much you knew if it is it is DILI cases or not. So my deomo would be end here, and now I'm going to give a conclusion. So we are using the Text Explorer to extract the data keyword and MedDRA terms using Stop Words and Phrase Management without manually selection, deletion and recoding. So we use a visualization and we created a document term matrix for prediction. And also we use machine learning for the using the XGBoost modeling and we want to quickly to run the XGBoost to find the best model and perform predict profile. And also we can save the predict formula to predict the new cases. Thank you. And I stop here.  
Michael Crotty, JMP Senior Statistical Writer, SAS Marie Gaudard, Statistical Consultant, Statistical Consultant Colleen McKendry, JMP Technical Writer, JMP   The need to model data sets involving imbalanced binary response data arises in many situations. These data sets often require different handling than those where the binary response is more balanced. Approaches include comparing different models and sampling methods using various evaluation techniques. In this talk, we introduce the Imbalanced Binary Response add-in for JMP Pro that facilitates comparing a set of modeling techniques and sampling approaches. The Imbalanced Classification add-in for JMP Pro enables you to quickly generate multiple sampling schemes and to fit a variety of models in JMP Pro. It also enables you to compare the various combinations of sampling methods and model fits on a test set using Precision-Recall ROC, and Gains curves, as well as other measures of model fit. The sampling methods range from relatively simple to complex methods, such as the synthetic minority oversampling technique (SMOTE), Tomek links, and a combination of the two. We discuss the sampling methods and demonstrate the use of the add-in during the talk.   The add-in is available here: Imbalanced Classification Add-In - JMP User Community.     Auto-generated transcript...   Speaker Transcript Michael Crotty Hello. Thank you for tuning into   our talk about the imbalanced classification add in that allows you to compare sampling techniques and models in JMP Pro.   I'm Michael Crotty. I'm one of these statistical writers in the documentation team at JMP and my co-presenter today is Colleen McKendry, also in the stat doc team. And this is work that we've collaborated on with Marie Gaudard.   So here's a quick outline of our talk today. We will look at the purpose of the add in that we created, some background on the imbalanced classification problem and how you obtain a classification model in that situation.   We'll look at some sampling methods that we've included in the add in and that are popular for the imbalanced classification problem.   We'll look at options that are available in the add in   and talk about how to obtain the add in, and then Colleen will show an example and a demo of the add in.   In the slides that are available on the Community, there's also references and an appendix that has additional background information.   So the purpose of our add in, the imbalanced classification add in, it lets you apply a variety of sampling techniques that are designed for imbalanced data.   You can compare the results of applying these techniques, along with various predictive models that are available in JMP Pro.   And you can compare those models and sampling technique fits using precision recall curves, ROC curves, and Gains curves, as well as other measures.   This allows you to choose a threshold for classification using the curves.   And you can also apply the Tomek, SMOTE and SMOTE plus Tomek sampling techniques directly to your data, which enables you to then use existing JMP platforms and   on on that newly sampled data and fine tune the modeling options, if you don't like the mostly default method options that we've chosen.   And just one note, the Tomek, SMOTE and SMOTE plus Tomek sampling techniques can be used with nominal and ordinal, as well as continuous predictor variables.   So some background on the imbalanced data problem.   So in general, you could have a multinomial response, but we will focus on the response variable being binary, and the key point is that the number of observations at one response level is much greater than the number of observations had the other response level.   And we'll call these response levels the majority and minority class levels, respectively. So the minority level, most of the time, is the level of interest that you're interested in predicting and detecting. This could be like detecting fraud or the presence of a disease or credit risk.   And we want to predict class membership based on regression variables.   So to do that we developed a predictive model that assigns probabilities of membership into the minority class and then we choose a threshold value that optimizes   various criteria. This could be misclassification rate, true positive rate, false positive rate, you name it. And then we classify an observation, who's into the minority class, if the predicted probability of membership to the minority class exceeds the chosen threshold value.   So how do we obtain a classification model?   We have lots of different platforms in JMP that can make a prediction for a binary variable, binary outcome   when in the presence of regression variables, and we need a way to compare those models. Well, there are some traditional measures, like classification accuracy, are not all that appropriate for imbalanced data. And just as a extreme example, you could consider the case of a 2% minority class.   I could give you 98% accuracy, just by classifying all the observations as majority cases. Now this would not be a useful model and you wouldn't want to use it,   because you're not predicting...you're not correctly predicting any of your target cases to minority cases but just overall accuracy, you'd be at 98%, which sounds pretty good.   So this led people to explore other ways to measure classification accuracy in a imbalanced classification model. One of those is the precision recall curve.   They're often used with imbalanced data and they plot the positive predictive value or precision against the true positive rate recall.   And because the precision takes majority instances into account, the PR curve is more sensitive to class imbalance than an ROC curve.   As such, a PR curve is better able to highlight differences in models for the imbalanced data. So the PR curve is what shows up first in our report for our add in.   Another way to handle imbalanced classification data is to use sampling methods that help to model the minority class.   And in general, these are just different ways to impose more balance on the distribution of the response, and in turn, that helps to better delineate the boundaries between the majority and minority class observations. So in in our add in we have seven different sampling techniques.   We won't talk too much about the first four and we'll focus on the last three, but very quickly, no weighting means what it sounds like. We won't do any...won't make any changes and that's   essentially in there to provide a baseline to what you would do if you didn't do any type of sampling method to account for the imbalance.   Weighting will overweight the minority cases so that the sum of the weights of the majority class and the minority class are the same.   Random undersampling will randomly exclude majority cases to get to a balanced case and random oversampling will randomly replicate   minority cases again to get to a balanced state.   And then we'll talk more about the next three more advanced methods in the following slides.   So first of the advanced methods is SMOTE, which stands for synthetic minority oversampling technique.   And this is basically a more sophisticated form of oversampling, because we are adding more minority cases to our data.   We do that by generating new observations that are similar to the existing minority class observations, but we're not simply replicating them like in oversampling.   So we use the Gower distance function and perform K nearest neighbors on each minority class observation and then observations are generated to fill in the space that are defined by those neighbors.   And in this graphic, you can see if we've got this minority case here in red. We've chosen the three nearest neighbors.   And we'll randomly choose one of those. It happens to be this one down here, and then we generate a case, another minority case that is somewhere in this little shaded box. And that's in two dimensions. If you had   n dimensions of your predictors, then that shaded area would be an n dimensional space.   But one key thing to point out is that you can choose the number of nearest neighbors that you   randomly choose between, and you can also choose how many times you'll perform this   this algorithm per minority case.   The next sampling method is Tomek links. And what this method does is it tries to better define the boundary between the minority and majority classes. To do that, it removes observations from the majority class that are close to minority class observations.   Again, we use to Gower distance to find Tomek links and Tomek link is a pair of nearest neighbors that fall into different classes. So one majority and one minority that are nearest neighbors to each other.   And to reduce the overlapping of these instances, one or both members of the pair can be removed. In the main option of our add in, the evaluate models option, we remove only the majority instance. However, in the Tomek option, you can use either form of removal.   And finally, the last sampling method is SMOTE plus Tomek. This combines the previous two sampling methods.   And the way it combines them is it applies this mode algorithm to generate new minority observations and then once you've got your original data, plus a bunch of generated new minority cases,   tt applies to Tomek algorithm to find pairs of nearest neighbors that fall into different classes. And in this method both observations in the Tomek pair are removed.   So the imbalanced classification add in has four options when you install it that all show up as submenu items under the add ins menu.   The first one is the evaluate models option, that allows you to fit a variety of models using a variety of sampling techniques. The next three are just standalone dialogues to just do those three sampling techniques that we just talked about.   So in the evaluate models option of the add in, it provides an imbalanced classification report that facilitates comparison of the model and sampling technique combinations.   It shows the PR curve and ROC curves, as well as the Gains curves, and for the PR and ROC curves, it shows the area under the curve, which generally, the more area under each of those curves, the better a model is fitting.   It provides the plot of predicted probabilities by class that helps you get a picture of how each model is fitting.   And it also provides a techniques and thresholds data table, and that table contains a script that allows you to reproduce the report   that is produced the first time you run the add in. And we want to emphasize that if you run this and you want to save your results without rewriting the entire   modeling and sampling methods algorithm, you can save this techniques and thresholds table and that will allow you to save your results and reproduce the report.   So now we'll look at the dialogue for the evaluating models option. It allows you to choose from a number of models and sampling techniques.   You can put in what your binary class variable is and all your X predictors, and then   we, in order to fit all the models and and   evaluate them on the on a test set, we randomly divide the data into training validation and test sets. You can provide up...you can set the proportions that will go into each of those sets.   There's a random seed option if you'd like to reproduce the results. And then there are SMOTE options   that I alluded to before, where you can choose the number of nearest neighbors, from which you select one to be the nearest neighbor used to generate a new case, and replication of each minority case is how many times you repeat the algorithm for each minority observation.   Again, there are three other sampling option   options in the add in and those correspond to Tomek, SMOTE and SMOTE plus Tomek. In the Tomek sampling option, it's going to add two columns to your data table that can be used as weights for the predict...for any predictive model that you want to do.   The first column removes only the majority nearest neighbor in the link and the other removes both members of the Tomek link, so you have that option.   SMOTE observations will add synthetic observations to your data table.   And it will also it will provide a source column so that you can identify which   observations were added. And SMOTE plus Tomek add synthetic observations and the weighting column that removes both members of the Tomek link.   And the weighting column from the Tomek sampling and SMOTE plus Tomek,   it's just an indicator column that you can use as a weight in a JMP modeling platform. It's just a 1 if it's included, and a 0 if it should be excluded.   Most of the three other sampling option dialogues look basically the same.   One option that's on them and not on the evaluate models option dialogue is show intermediate tables. This option appears for SMOTE and SMOTE plus Tomek.   And basically, it allows you to see data tables that were used in the construction of the SMOTE observations. In general, you don't need to see it, but if you want to better understand how those observations are being generated, you can take a look at those intermediate tables.   And they're all explained in the documentation.   Again, you can obtain the add in through the Community,   through this the page for this talk on the Discovery Summit Americas 2020   part of the Community. And as I mentioned just a second ago, there's documentation available within the add in. Just click the Help button.   And now it is time for Colleen to show an example in a demo of the add in. Colleen McKendry Thanks Michael. I'm going to demo the add in now, and to do the demo, I'm going to use this mammography demo data.   And so the mammography data set is based on a set of digitized film mammograms used in a study of microcalcifications in mammographic images.   And in this data, each record is classified as either a 1 or a 0. 1 represents that there's calcification seen in the image, and a 0 represents that there is not.   In this data set, the images where you see calcification, those are the ones you're interested in predicting and so the class level one is the class level that you're interested in.   In the full data set, there are six continuous predictors and about 11,000 observations.   But in order to reduce the runtime in this demo, we're only going to use a subset of the full data set. And so it's going to be about half the observations. So about 5500 observations.   And the observations that are classified as 1, the things that you're interested in, they represent about 2.31% of the total observations, both in the full data set and in the demo data set that we're using. And so we have about a 2% minority proportion.   And now I'm going to switch over to JMP   to   So I have the mammography demo data.   And we're going to open and I've already installed the add in. So I have the imbalanced classification add in in my drop down menu and I'm going to use the evaluate models option.   And so here we have the launch window, and we're going to specify the binary class variable, your predictor variables, we're going to select all the models and all the techniques and we're going to specify   a random seed.   And click OK.   And so while this is running, I'm going to explain what's actually happening in the background. So the first thing that the add in does is that it splits the data table into a training data set and a test data set.   And so you have two separate data tables and then within the training data table those observations are further split into training and validation observations and the validation is used in the model fitting.   And so once you have those two data sets,   there are indicator variables...indicator columns that are added to the training data table for each of the sampling techniques that you specify, except for those that have involve SMOTE.   And so those columns are added and are used as weighting columns and they just specify whether the observation is to be included in the analysis or not.   If you specify a sampling technique with SMOTE, then there are additional rows that are added to the data table. Those are the generated observations.   So once your columns and your rows are generated then for each model, each model is fit to each sampling technique. And so if you select all of them   like we just did here, there are a total of 42 different models that are being fit. And so, that's all what's happening right now. In   the current demo, we have 42 models being fit and once the models are fit, then the relevant information is gathered and put together in a results report. And that report,   which will hopefully pop up soon, here it is, that report is shown here. And you also get a techniques and thresholds table and a summary table.   And so we're going to take a look at what you get when you run the add in. So first we have the training set. And you can see that here are the weighting columns, the weight columns that are added. And these are the columns that are added for the predicted probabilities for those observations.   Then we have the test set. This doesn't contain any of those weighting columns, but it does have the predicted probabilities for the test set observations.   We have the results report   and the techniques and thresholds data table. And so Michael mentioned this in   the talk earlier, but this is important because this is the thing that you would like to save if you want to save your results and view your results again   without having to rerun the whole script. And so this data table is what you would save and it contains scripts that will reproduce   the results report and the summary table, which is the last thing that I have to show. And so this is just contains summaries for each sampling technique and model combination and their AUC values.   So now to look at the actual results window, at the top we have dialogue specifications. And so this contains the information that you specified in the launch window.   So if you forget anything that you specified, like your random seed or what proportions you assign, you can just open that and take a look.   And we also have the binary class distribution. So, the distribution of the class variable across the training and the test set. And this is designed so that the proportion should be the same, which they are in this case at 2.3.   And then we also have missing threshold. So this isn't super important, but it just gives   an indication of if a value of the class variable has a missing prediction value, then that's shown here.   For the actual results, we have these tabbed graphs. And so we have the precision recall curves, the ROC curves, and the cummulative Gains curves. And for the PR curves and the ROC curves, we have the corresponding AUC values as well.   We also have these graphs of the predicted probabilities by class. And those are more useful when you're only viewing a few of them at a time, which we will later on.   And then we have a data filter that connects all these graphs and results together.   So for our actual results for the analysis, we can take a look now. So first I'm going to sort these.   So you can already see that the ROC curve and the PR curve, there's a lot more differentiation between the curves in the PR curve than there is in the ROC curve.   And if we select the top, say, five, these all actually have an AUC of .97.   And you can see that they're all really close to each other. They're basically on top of each other. It would be really hard to determine which model is actually better, which one you should pick   And so that's where, particularly with imbalanced data, the precision recall curves are really important. So if we switch back over, we can see that these models that had the highest AUC values for the ROC curves,   they're really spread out in the precision recall curve. And they're actually not...they don't have the highest AUC values for the PR curve.   So maybe that there...maybe there's a better model that we can pick.   So now I'm going to look and focus on the top two, which are boosted tree Tomek and SVM Tomek, and I'm going to do that using the data filter.   And then we just want to look at those are going to show and include.   So now we have the curves for just these two models and the blue curve is the boosted tree and the red curve is SVM.   And so you can see in these curves that they kind of overlap each other across different values of the true positive rate. And so you could use these curves   to choose which model you want to use in your analysis, based on maybe what an acceptable true positive rate would be. So we can see this if I add some reference lines. Excuse my hands that you will see as I type this.   Okay, so say that these are some different true positive rates that you might be interested in. So if, for example, for whatever data set you have, you wanted a true positive rate of .55.   You could pick your threshold to make that the true positive rate. And then in this case,   for that true positive rate, the boosted tree Tomek model has a higher precision. And so you could you could pick that model.   However, if you wanted your true positive rate to be something like .85, then the SVM model might be a better pick because it has a higher precision for that specific true positive rate.   And then if you had a higher true positive rate of .95, you would flip again and maybe you would want to pick the boosted tree model.   So that's how you can use these curves to pick which model is best for your data.   And now we're going to look at these graphs again, now that there are only a few of them. And this just shows the distribution of predicted probabilities for each class for the models that we selected. So in this particular case, you can see that in SVM there are majority   probabilities throughout the kind of the whole range of predicted probabilities, where boosted tree does kind of a better job of keeping them at the lower end.   And so that's it for this particular demo, but before we're done, I just wanted to show one more thing. And so that was an example of how you would use the evaluate   models option. But say you just wanted to use a particular sampling technique. And you can do that here. So the launch window looks much the same. And you can assign your binary class, your predictors, and click OK.   And this generates a new data table and you have your   indicator column.   Your indicator column, which just shows whether the observation should be included in the analysis or not.   And then because it was SMOTE plus Tomek you also have all these SMOTE generated observations.   So now you have this new data table and you can use any type of model with any type of options that you may want and just use this column as your weight or frequency column and go from there. And that is the end of our demo for the imbalanced classification add in. Thanks for watching.
Hadley Myers, JMP Systems Engineer, SAS Chris Gotwalt, JMP Director of Statistical Research and Development, SAS   The need to determine confidence intervals for linear combinations of random mixed-model variance components, especially critical in Pharmaceutical and Life Science applications, was addressed with the creation of a JMP Add-In, demonstrated at the JMP Discovery Summit Europe 2020 and available at the JMP User Community. The add-in used parametric bootstrapping of the sample variance components to generate a table of simulated values and calculated “bias-corrected” (BC) percentile intervals on those values. BC percentile intervals are better in accounting for asymmetry in simulated distributions than standard percentile intervals, and a simulation study using a sample data set at the time showed closer-to-true α-values with the former. This work reports on the release of Version 2 of the Add-In, which calculates both sets of confidence intervals (standard and BC percentiles), as well as a third set, the “bias-corrected and accelerated” confidence interval, which has the advantage of adjusting for underlying higher-order effects. Users will therefore have the flexibility to decide for themselves the appropriate method for their data. The new version of the Add-In will be demonstrated, and an overview of the advantages/disadvantages of each method will be addressed. (view in My Videos)     Auto-generated transcript...   Speaker Transcript Hello, my name is Chris Gotwalt 00 08.966 3 has been developed for variance components models, we we think 00 25.566 7 statistical process control program, one has to understand 00 40.466 11 ascertain how much measurement error is attributable to testing 00 55.500 15 there might be five or 10 units or parts tested per operator, 00 10.766 19 different measuring tools is small enough that differences in 00 26.033 23 measurement to measurement, repeatability variation, or 00 39.900 27 measurement systems analyses, as well as a confidence interval on 00 52.766 31 interval estimates in the report and obtain a valid 95% interval 00 07.033 35 calculate confidence intervals, because we believed it would be 00 23.400 39 and the sum of the variance components. Unfortunately, the 00 38.266 43 r&r study. So because variance components explicitly violate 00 57.300 47 you were to use the one click bootstrap on variance components 00 10.566 51 less. So when we were designing fit mixed, and the REML 00 27.933 55 independent. So back to the drawing board. So it turns out 00 44.666 59 in JMP. One approach is called the parametric bootstrap that 00 01.333 63 comparison of the two kind of families of bootstrap. So the 00 18.333 67 they're, they're not assuming any underlying model. And it's 00 37.200 71 the rows in the data table are independent from one another. 00 52.766 75 values, it has the advantage that we don't have to make this 00 09.866 79 bootstrap simulation. The downside to this is that you 00 25.133 83 do a quick introduction to what the bootstrap...the parametric 00 41.966 87 to identify or wanted to estimate the crossing time of a 00 04.733 91 162.8. Now, we want to use a parametric bootstrap to to go 00 22.466 95 has the ability to save the simulation formula back to the 00 35.933 99 that uses the estimates in the report as inputs into a random 00 53.300 00.666 104 And we take our estimates and pull them out into a separate 00 17.666 108 And then what we have can be seen as a random sample from the 00 37.000 112 formula column for the crossing time. And that is automatically 00 53.900 116 those...on the crossing time, or any quantity of interest. When 00 15.366 120 simulation, create a formula column of whatever function of 00 28.366 124 derive quantity of interest and obtain confidence intervals 00 47.233 128 the add in so that you're able to do this quite easily for 00 05.033 132 133 we'll start by showing you how to run the add in yourself once 00 25.500 137 first version was presented at the JMP 2020 Discovery Summit 00 42.566 141 overview, but we'll show you the references where you can dive in 00 58.866 145 perfectly fine as well. So I'm going to go ahead and start with 00 14.700 149 makes use of the fit mixed platform, right, created from 00 31.333 153 the add in will only work with JMP Pro. So someone might, 00 49.066 157 want some measure like reproducibility. So that would 00 10.166 161 as we said, to calculate the estimate for these, there's no 00 26.066 165 columns here. The reality is much, much, much more 00 43.066 169 of the estimate without considering the worst case 00 59.233 173 production that the actual variance is higher than they have 00 19.733 177 don't risk being out of spec in production. So to run the add in 00 35.700 181 From here, I can select the linear combination of confidence 00 55.266 185 simulations, you get a better estimate of the confidence 00 10.500 189 2500. I'm going to leave it as 1000 here just for demonstration 00 28.733 193 operator or the batch variable, and then press perform analysis. 00 45.533 197 purpose of this demonstration, I think I will stop it early. 00 07.733 201 calculated confidence limits, the bootstrap quantiles, which are 00 28.933 205 these two tabs. But if you'd like to see how those compare, 00 42.366 209 so what does enough mean, enough for your confidence limits to 00 57.400 213 stopped it before a thousand. So that's how the add in works. And 00 15.466 217 distributed around the original estimate, they are in fact 00 37.366 221 relaunch this analysis. So you'll see that when the 00 56.433 225 European Discovery, required bounded variance confidence 00 16.766 229 that, if that happens for some of the bootstrap samples or for 00 40.466 233 early, again, I'll just let it run a little bit. Yeah, so I, as 00 00.966 237 the samples are allowed, in some cases, to be below zero. So in 00 28.400 242 simulation column here, this column of simulated 00 49.100 246 see them both at the same time. It's a bit... it's a bit tricky, 00 16.300 252 right components, it's a good idea to run the add in directly 00 31.766 256 that column is then deleted. So one thing to to mention, before 00 50.500 260 accounts for the skewness of the bootstrap distributions, right, 00 14.200 264 that. And then the accelerated takes that even further. So here 00 27.200 268 thing to mention is that the alpha in this represents the 00 43.700 272 value for which it's been calculated? And what can we do to 00 03.233 276 up to investigate the four different kinds of the variance 00 24.700 280 method, the bias corrected method and the BCa. We also 00 43.000 284 study. So for all 16 combinations of these three 00 01.566 288 combinations of confidence intervals, and kept track of how 00 20.400 292 293 coverage as we're varying these three variables, and we see here 00 45.300 297 298 techniques. And the second best is the bias corrected and 00 09.966 302 the best one. Now, if you turn no bounds on, which means that 00 28.433 306 variance components with a pretty close to 95% coverage. 00 48.200 310 intervals are performing similarly at about 93%. But 00 02.200 07.800 315 to what a master's thesis paper's research would have, would 00 27.966 319 potentially more work to be done. There's other interval 00 42.566 323 things like generalized confidence intervals. General 00 59.466 327 intervals might also do the trick for us as well. Hadley's 00 19.966 331 so that you can now do parametric bootstrap simulations 00 37.566 335 16. When you bring that up, you can enter the linear combination 00 51.766 339  
Monday, October 12, 2020
Kamal Kannan Krishnan, Graduate Student, University of Connecticut Ayush Kumar, Graduate Student, University of Connecticut Namita Singh, Graduate Student, University of Connecticut Jimmy Joseph, Graduate Student, University of Connecticut   Today all service industries, including the telecom face a major challenge with customer churn, as customers switch to alternate providers due to various reasons such as competitors offering lower cost, combo services and marketing promotions. With the power of existing data and previous history of churned customers, if company can predict in advance the likely customers who may churn voluntarily, it can proactively take action to retain them by offering discounts, combo offers etc, as the cost of retaining an existing customer is less than acquiring a new one.  The company can also internally study any possible operational issues and upgrade their technology and service offering. Such actions will prevent the loss of revenue and will improve the ranking among the industry peers in terms of number of active customers. Analysis is done on the available dataset to identify important variables needed to predict customer churn and individual models are built. The different combination of models is ensembled, to average and eliminate the shortcomings of individual models.  The cost of misclassified prediction (for False Positive and False Negative) is estimated by putting a dollar value based on Revenue Per User information and cost of discount provided to retain the customer.     Auto-generated transcript...   Speaker Transcript Namita Hello everyone I'm Namita, and I'm here with my teammates Ayush, Jimmy and Kamal from University of Connecticut to present our analysis on predicting telecom churn using JMP. The data we have chosen is from industry that keeps us all connected, that is the telecom and internet service industry. So let's begin with a brief on the background. The US telecom industry continues to witness intense competition and low customer stickiness due multiple reasons like lower cost, combo promotional offers, and service quality. So to align to the main objective of preventing churn, telecom companies often use customer attrition analysis as their key business insights. This is due to the fact that cost of retaining an existing customer is far less than acquiring a new one. Moving on to the objective, the main goal here is to predict in advance the potential customers who may attrite. And then based on analysis of that data ,recommend customized product strategies to business. We have followed the standard SEMMA approach here. Now let's get an overview of the data set. It consists of total 7,043 rows of customers belonging to different demographics (single, with dependents, and senior) and subscribing to different product offerings like internet service, phone lines, streaming TV, streaming movies and online security. There are about 20 independent variables; out of it, 17 are categorical and three are continuous. The dependent target variable for classification is customer churn. And the churn rate for baseline model is around 26.5%. Goal is now to pre process this data and model it for future analysis. That's it from my end over to you, Ayush. Ayush Kumar Thanks, Namita. I'm Ayush. In this section, I'll be talking about the data exploration and pre processing. In data exploration, we discovered interesting relationships, for instance, variables tenure and monthly charges both were positively correlated to total charges. These three variables we analyzed using scatter plot matrix in JMP, which validated the relationship. Moreover, by using explore missing values functionality, we observed that total charges column had 11 missing values. The missing values were taken care of as a total charges column was excluded due to multicollinearity. After observing the histograms of the variables using exclude outlier functionality, we concluded that the data set had no outliers. The variable called Customer ID had 7,043 unique values which would not add any significance to the target variable. So customer ID was excluded. We were also able to find interesting pattern among the variables. Variables such a streaming TV and streaming movies convey the same information about the streaming behavior. These variables were grouped into a single column streaming to by using our formula in JMP. The same course of action was taken for the variables online backup and online security. We ran logistic regression and decision tree in JMP to find out the important variables. From the effects summary, it was observed that tenure, contract type, monthly charges, streaming to, multiple line service, and payment method showed significant log worth and very important variables in determining the target. The effects on ??? also helped us to narrow down a variable count to 12 statistically significant variables, which formed the basis for further modeling. We use value of ??? functionality and moved Yes of our target variable upwards. Finally, the data was split into training validation and test in 16 20 20 ratio using formula random method. Over to you now, Kamal. Kamal Krishnan Sorry, I am Kamal. I will explain more about the different models built in JMP using the data set. We in total built eight different types of model. On each type of model, we tried various input configuration and settings to improve the results of mainly sensitivity. As our target was to reduce the number of false negatives in the classification. JMP is very user friendly to redo the models by changing the configurations. It was easy to store the results whenever a new iteration of the model is done in JMP and then compare outputs in order to select the optimized model from each type. JMP allowed us to even change the cutoff values from default 0.5 to others and observed the prediction results. This slide shows the results of selected model from eight different type of models. First, as our top target variable journeys categorical we built logistic regression. Then we build decision tree, KNN, ensemble models like Bootstrap forest and boosted tree. Then we built machine learning models like neural networks. JMP allowed us to set the random seed in models like neural networks and KNN. This helped us to get the same outputs we needed. Then we built naive Bayes model. JMP allowed us to study the impact of various variables through prediction profiler. We can point and click on to change the values in the range and see how it impacts the target variable. By changing the prediction profiler in naive bayes, we observed that increase in tenure period helps in reducing the churn rate. On the contrary, increase in monthly charges increases the churn rate. Finally, we did ensembel of different combination of models to average and eliminate the shortcomings of individual models. We found that in ensembling neural network and naive bayes has higher sensitivity among ???. This ends the model description. Over to you, Jimmy. JJoseph Thank you, Kamal. In this section we will be comparing the models and looking deeper dive into each model detail. The major parameters used to compare the models are cost of misclassification in dollars, sensitivity versus accuracy chart, lift ratio, and area under the curve values. The cost of misclassification data is depicted on the right, top corner of the slide. Cost of false positives and false negative determined using average monthly charges. That cost of false negative model predicted no turn for customer potentially leaving, calculated to dollar (85) and cost of false negative at dollar (14) after discounting 20% to accommodate additional benefits. The cost comparison chart clearly indicate that the niave bayes has the lowest cost. Going on to total accuracy rates chart with it is between 74 to 81%, not much variation in most of the models. And lift, a measure of probability to find a success record compared to baseline model, varies between 1.99 to 3.11. The AUC or ROC curve is another measure us to determine the strength of the model with different type of values. As chart indicates all the models did equally well in this category. The sensitivity and accuracy chart measure the models' success to predict the customer churn accurately. The chart indicates two facts How many customers that the model can correctly predict; to how often the prediction be accurate. This measure is used as the major parameter to decide the best performing model and naive bayes did well in this category. Based on the various metrics and considering the cost of failed prediction of models, naive bayes came out as the best and parsimonious model to predict the customer churn for the given data set. It has lowest misclassification ratio, high sensitivity, and reasonably good total accuracy. If you discount some of its inherent drawbacks, such as lack of a statistical model to support, the model is completely data driven and easily explainable. Moving on to the conclusions drawn, the significant variables in the data set are contract and tenure of customer enrolled. From modeling, we observed that churning of customer is high for 1) those without dependent in demography; 2) those who pay a high price for their phone services, low customer satisfaction rate on high end services; 3) customers stick to the original single line on service easy switch over to competitors. So based on those findings, the recommendations are 1) targeted customer promotion focused on in income generation; 2) push long term contract with additional incentives; 3) build a product line combo focusing on customer needs. In conclusion, we use JMP tool to do analysis and predictive models on limited data set. It is very effective and powerful to to do those analysis, please reach out to us if you have any further questions. Thank you.  
Steve Hampton, Process Control Manager, PCC Structurals Jordan Hiller, JMP Senior Systems Engineer, JMP   Many manufacturing processes produce streams of sensor data that reflect the health of the process. In our business case, thermocouple curves are key process variables in a manufacturing plant. The process produces a series of sensor measurements over time, forming a functional curve for each manufacturing run. These curves have complex shapes, and blunt univariate summary statistics do not capture key shifts in the process. Traditional SPC methods can only use point measures, missing much of the richness and nuance present in the sensor streams. Forcing functional sensor streams into traditional SPC methods leaves valuable data on the table, reducing the business value of collecting this data in the first place. This discrepancy was the motivator for us to explore new techniques for SPC with sensor stream data. In this presentation, we discuss two tools in JMP — the Functional Data Explorer and the Model Driven Multivariate Control Chart — and how together they can be used to apply SPC methods to the complex functional curves that are produced by sensors over time. Using the business case data, we explore different approaches and suggest best practices, areas for future work and software development.     Auto-generated transcript...   Speaker Transcript Jordan Hiller Hi everybody. I'm Jordan Hiller, senior systems engineer at JMP, and I'm presenting with Steve Hampton, process control manager at PCC Structurals. Today we're talking about statistical process control for process variables that have a functional form.   And that's a nice picture right there on the title   slide. We're talking about statistical process control, when it's not a single number, a point measure, but instead, the thing that we're trying to control has the shape of a functional curve.   Steve's going to talk through the business case, why we're interested in that in a few minutes. I'm just going to say a few words about methodology.   We reviewed the literature in this area for the last 20 years or so. There are many, many papers on this topic. However, there doesn't really appear to be a clear consensus about the best way to approach this statistical   process control   when your variables take the form of a curve. So we were inspired by some recent developments in JMP, specifically the model driven multivariate control chart introduced in JMP 15 and the functional data explorer introduced in JMP 14.   Multivariate control charts are not really a new technique they've been around for a long time. They just got a facelift in JMP recently.   And they use either principal components or partial least squares to reduce data, to model and reduce many, many process variables so that you can look at them with a single chart. We're going to focus on the on the PCA case, we're not really going to talk about partial   the   partial least squares here.   Functional Data Explorer is the method we use in JMP in order to work with data in the shape of a curve, functional   data. And it uses a form of principal components analysis, an extension of principal components analysis for functional data.   So it was a very natural kind of idea to say what if we take our functional curves, reduce and model that using the functional data explorer.   The result of that is functional principal components and just as you you would add regular principal components and push that through a model driven multivariate control chart,   what if we could do that with a functional principal components? Would that be feasible and would that be useful?   So with that, I'll turn things over to Steve and he will introduce the business case that we're going to discuss today. 1253****529 All right. Thank you very much. Jordan.   Since I do not have video, I decided to let you guys know what I look like.   There's me with my wife Megan and my son Ethan   with last year's pumpkin patch. So I wanted to step into the case study with a little background on   what I do, and so you have an idea of where this information is coming from. I work in investment casting for precision casting...   Investment Casting Division.   Investment casting involves making a wax replicate of what you want to sell, putting it into a pattern assembly,   dipping it multiple times in proprietary concrete until you get enough strength to be able to dewax that mold.   And we fire it to have enough strength to be able to pour metal into it. Then we knock off our concrete, we take off the excessive metal use for the casting process. We do our non destructive testing and we ship the part.   The drive for looking at improved process control methods is the fact that   Steps 7, 8, and 9 take up 75% of the standing costs because of process variability in Steps 1-6. So if we can tighten up 1-6,   most of ??? and cost go there, which is much cheaper, much shorter, then there is a large value add for the company and for our customers in making 7, 8, and 9 much smaller.   So PCC Structurals. My plant, Titanium Plant, makes mostly aerospace components. On the left there you can see a fan ??? that is glowing green from some ??? developer.   And then we have our land based products, which right there's a N155 howitzer stabilizer leg.   And just to kind of get an idea where it goes. Because every single airplane up in the sky basically has a part we make or multiple parts, this is an engine sections ???, it's about six feet in diameter, it's a one piece casting   that goes into the very front of the core of a gas turbine engine. This one in particular is for the Trent XWB that powers the Airbus A350   jets.   So let's get into JMP. So the big driver here is, as you can imagine, with something that is a complex as an investment casting process for a large part, there is tons of   data coming our way. And more and more, it's becoming functional as we increase the number of centers, we have and we increase the number of machines that we use. So in this case study, we are looking at   data that comes with a timestamp. We have 145 batches. We have our variable interest which is X1.   We have our counter, which is a way that I've normalized that timestamp, so it's easier to overlay the run in Graph Builder and also it has a little bit of added   niceness in the FTP platform. We have our period, which allows us to have that historic period and a current period that lines up with the model driven multivariate control chart platform,   so that we can have our FDE   only be looking at the historic so it's not changing as we add more current data. So this is kind of looking at this if you were in using this in practice, and then the test type is my own validation   attempts. And what you'll see here is I've mainly gone in and tagged thing as bad, marginal or good. So red is bad, marginal is purple, and green is good and you can see how they overlay.   Off the bat, you can see that we have some curvey   ??? curves from mean. These are obviously what we will call out of control or bad.   This would be what manufacturing called a disaster because, like, that would be discrepant product. So we want to be able to identify those   earlier, so that we can go look at what's going on the process and fix it. This is what it looks like   breaking out so you can see that the bad has some major deviation, sometimes of mean curve and a lot of character towards the end.   The marginal ones are not quite as deviant from the mean curves but have more bouncing towards the tail and then good one is pretty tight. You can see there's still some bouncing. So this is where the   the marginal and the good is really based upon my judgment, and I would probably fail an attribute Gage R&R based on just visually looking at this. So   we have a total of 33 bad curves, 45 marginal and 67. And manually, you can just see about 10 of them are out. So you would have an option if you didn't want to use a point estimate, which I'll show a little bit later that doesn't work that great, of maybe making...   control them by points using the counter. And how you do that would be to split the bad table by counter, put it into an individual moving range control chart through control chart building and then you would get out,   like 3500 control charts in this case, which you can use the awesome ability to make combined data tables to turn that that list summary from each one into its own data table that you can then link back to your main data table and you get a pretty cool looking   analysis that looks like this, where you have control limits based upon the counters and historic data and you can overlay your curves. So if you had an algorithm that would tag whenever it went outside the control limits, you know, that would be an option of trying to   have a control....   a control chart functionality with functional data. But you can see, especially I highlighted 38 here, that you can have some major deviation and stay within the control limits. So that's where this FDE   platform really can shine, in that it can identify an FPC that corresponds with some of these major deviations. And so we can tag the curves based upon those at FPCs.   And we'll see that little later on. So,   using the FDE platform, it's really straightforward. Here for this demonstration, we're going to focus on a step function with 100 knots.   And you can see how the FPCs capture the variability. So the main FPC is saying, you know, beginning of the curve, there's...that's what's driving the most variability, this deviation from the mean.   And setup is X1 and their output, counters. Our input, batch number and then I added test type. So we can use that as some of our validation in FPC table and the model driven multivariate control chart and the period so that only our historic is what's driving the FDE fit.   And so   just looking at the fit is actually a pretty important part of making sure you get correct   control charting later on, is I'm using this P Step   Function 100 knots model. You can see, actually, if I use a B spline and so with Cubic 20 knots, it actually looks pretty close to my P spline.   But from the BIC you can actually see that I should be going to more knots, so if I do that, now we start to see them overfitting, really focusing on the isolated peaks and it will cause you to have an FDE   model that doesn't look right and causes you to not be as sensitive and your model driven multivariate control chart.   0
Monday, October 12, 2020
Jordan Hiller, JMP Senior Systems Engineer, JMP Mia Stephens, JMP Principal Product Manager, JMP   For most data analysis tasks, a lot of time is spent up front — importing data and preparing it for analysis. Because we often work with data sets that are regularly updated, automating our work using scripted repeatable workflows can be a real time saver. There are three general sections in an automation script: data import, data curation, and analysis/reporting. While the tasks in the first and third sections are relatively straightforward — point-and click to achieve the desired result and capture the resulting script — data curation can be more challenging for those just starting out with scripting. In this talk we review common data preparation activities, discuss the JSL code necessary to automate the process, and provide advice for generating JSL code for data curation via point-and-click.     The Data Cleaning Script Assistant Add-in discussed in this talk can be found in the JMP File Exchange.     Auto-generated transcript...   Speaker Transcript mistep Welcome to JMP Discovery summit. I'm Mia Stephens and I'm a JMP product manager and I'm here with Jordan Hiller, who is a JMP systems engineer. And today we're going to talk about automating the data curation workflow. And we're going to split our talk into two parts. I'm going to kick us off and set the stage by talking about the analytic workflow and where data curation fits into this workflow. And then I'm going to turn it over to Jordan for the meat, the heart of this talk. We're going to talk about the need for reproducible data curation. We're going to see how to do this in JMP 15. And then you're going to get a sneak peek at some new functionality in JMP 16 for recording data curation steps and the actions that you take to prepare your data for analysis. So let's think about the analytic workflow. And here's one popular workflow. And of course, it all starts with defining what your business problem is, understanding the problem that you're trying to solve. Then you need to compile data. And of course, you can compile data from a number of different sources and pull these data in JMP. And at the end, we need to be able to share results and communicate our findings with others. Probably the most time-consuming part of this process is preparing our data for analysis or curating our data. So what exactly is data curation? Well, data curation is all about ensuring that our data are useful in driving analytics discoveries. Fundamentally, we want to be able to solve a problem with the day that we have. This is largely about data organization, data structure, and cleaning up data quality issues. If you think about problems or common problems with data, it generally falls within four buckets. We might have incorrect formatting, incomplete data, missing data, or dirty or messy data. And to talk about these types of issues and to illustrate how we identify these issues within our data, we're going to borrow from our course, STIPS And if you're not familiar with STIPS, STIPS is our free online course, Statistical Thinking for Industrial Problem Solving, and it's set up in seven discrete modules. Module 2 is all about exploratory data analysis. And because of the interactive and iterative nature of exploratory data analysis and data curation, the last lesson in this module is data preparation for analysis. And this is all about identifying quality issues within your data and steps you might take to curate your data. So let's talk a little bit more about the common issues. Incorrect formatting. So what do we mean by incorrect formatting? Well, this is when your data are in the wrong form or the wrong format for analysis. This can apply your data table as a whole. So, for example, you might have your data in separate columns, but for analysis, you need your data stacked in one column. This can apply to individual variables. You might have the wrong modeling type or data type or you might have date data, data on dates or times that's not formatted that way in JMP. It can also be cosmeti. You might choose to remove response variables to the beginning of the data table, rename your variables, group factors together to make it easier to find them with the data table. Incomplete data is about having a lack of data. And this can be on important variables, so you might not be capturing data on variables that can ultimately help you solve your problem or on combinations of variables. Or it could mean that you simply don't have enough observations, you don't have enough data in your data table. Missing data is when values for variables are not available. And this can take on a variety of different forms. And then finally, dirty or messy data is when you have issues with observations or variables. So your data might be incorrect. The values are simply wrong. You might have inconsistencies in terms of how people were recording data or entering data into the system. Your data might be inaccurate, might not have a capable measurement system, there might be errors or typos. The data might be obsolete. So you might have collected the information on a facility or machine that is no longer in service. It might be outdated. So the process might have changed so much since you collected the data that the data are no longer useful. The data might be censored or truncated. You might have columns that are redundant to one another. They have the same basic information content or rows that are duplicated. So dirty and messy data can take on a lot of different forms. So how do you identify potential issues? Well, when you take a look at your data, you start to identify issues. And in fact, this process is iterative and when you start to explore your data graphically, numerically, you start to see things that might be issues that you might want to fix or resolve. So a nice starting point is to start by just scanning the data table. When you scan your data table, you can see oftentimes some obvious issues. And for this example, we're going to use some data from the STIPS course called Components, and the scenario is that a company manufactures small components and they're trying to improve yield. And they've collected data on 369 batches of parts with 15 columns. So when we take a look at the data, we can see some pretty obvious issues right off the bat. If we look at the top of the data table, we look at these nice little graphs, we can see the shapes of distributions. We can see the values. So, for example, batch number, you see a histogram. And batch number is something you would think of being an identifier, rather than something that's continuous. So this can tell us that the data coded incorrectly. When we look at number scrapped, we can see the shape of the distribution. We can also see that there's a negative value there, which might not be possible. we see a histogram for process with two values, and this can tell us that we need to change the modeling type for process from continuous to nominal. You can see more when you when you take a look at the column panel. So, for example, batch number and part number are both coded as continuous. These are probably nominal And if you look at the data itself, you can see other issues. So, for example, humidity is something we would think of as being continuous, but you see a couple of observations that have value N/A. And because JMP see text, the column is coded as nominal, so this is something that you might want to fix. we can see some issues with supplier. There's a couple of missing values, some typographical errors. And notice, temperature, all of the dots indicate that we're missing values for temperature in these in these rows. So this is an issue that we might want to investigate further. So you identify a lot of issues just by scanning the data table, and you can identify even more potential issues when you when you visualize the data one variable at a time. A really nice starting point, and and I really like this tool, is the column viewer. The column viewer gives you numeric summaries for all of the variables that you've selected. So for example, here I'm missing some values. And you can see for temperature that we're missing 265 of the 369 values. So this is potentially a problem if we think the temperature is an important factor. We can also see potential issues with values that are recorded in the data table. So, for example, scrap rate and number scrap both have negative values. And if this isn't isn't physically possible, this is something that we might want to investigate back in the system that we collected the data in. Looking at some of the calculated statistics, we can also see other issues. So, for example, batch number and part number really should be categorical. It doesn't make sense to have the average batch number or the average part number. So this tells you you should probably go back to the data table and change your modeling type. Distributions tell us a lot about our data and potential issues. We can see the shapes of distributions, the centering, the spread. We can also see typos. Customer number here, the particular problem here is that there are four or five major customers and some smaller customers. If you're going to use customer number and and analysis, you might want to use recode to group some of those smaller customers together into maybe an other category. we have a bar chart for humidity, and this is because we have that N/A value in the column. And we might not have seen that when we scan the data table, but we can see it pretty clearly here when we look at the distribution. We can clearly see the typographical errors for supplier. And when we look at continuous variables, again, you can look at the shape, centering, and spread, but you can also see some unusual observations within these variables. So, after looking at the data one variable at a time, a natural, natural progression is to explore the data two or more variables at a time. So for example, if we look at scrap rate versus number scrap in the Graph Builder. We see an interest in pattern. So we see these these bands and it could be that there's something in our data table that helps us to explain why we're seeing this pattern. In fact, if we color by batch size, it makes sense to us. So where we have batches with 5000 parts, there's more of an opportunity for scrap parts than for batches of only 200. We can also see that there's some strange observations at the bottom. In fact, these are the observations that had negative values for the number of scrap and these really stand out here in this graph. And when you add a column switcher or data filter, you can add some additional dimensionality to these graphs. So I can look at pressure, for example, instead of... Well, I can look at pressure or switch to dwell. What I'm looking for here is I'm getting a sense for the general relationship between these variables and the response. And I can see that pressure looks like it has a positive relationship with scrap rate. And if I switch to dwell, I can see there's probably not much of a relationship between dwell and scrap rate or temperature. So these variables might not be as informative in solving the problem. But look at speed, speed has a negative relationship. And I've also got some unusual observations at the top that I might want to investigate. So you can learn a lot about your data just by looking at it. And of course, there are more advanced tools for exploring outliers and missing values that are really beyond the scope of this discussion. And as you get into the analyze phase, when you start analyzing your data or building models, you'll learn much much more about potential issues that you have to deal with. And the key is that as you are taking a look at your data and identifying these issues, you want to make notes of these issues. Some of them can be resolved as you're going along. So you might be able to reshape and clean your data as you proceed through the process. But you really want to make sure that you capture the steps that you take so that you can repeat the steps later if you have to repeat the analysis or if you want to repeat the analysis on new data or other data. And at this point is where I'm going to turn it over to to Jordan to talk about reproducible data curation and what this is all about. Jordan Hiller Alright thanks, Mia. That was great. And we learned what you do in JMP to accomplish data curation by point and click. Let's talk now about making that reproducible. The reason we worry about reproducibility is that your data sets get updated regularly with new data. If this was a one-time activity, we wouldn't worry too much about the point and click. But when data gets updated over and over, it is too labor-intensive to repeat the data curation by point and click each time. So it's more efficient to generate a script that performs all of your data curation steps, and you can execute that script with one click of a button and do the whole thing at once. So in addition to efficiency, it documents your process. It serves as a record of what you did. So you can refer to that later for yourself and remind yourself what you did, or for people who come after you and are responsible for this process, it's a record for them as well. For the rest of this presentation, my goal is to show you how to generate a data curation script with point and click only. We're hoping that you don't need to do any programming in order to get this done. That program code is going to be extracted and saved for you, and we'll talk a little bit about how that happens. So there are two different sections. There's what you can do now in JMP 15 to obtain a data curation script, and what you'll be doing once we release JMP 16 next year. In JMP 15 there are some data curation tasks that generate their own reusable JSL scripting code. You just execute your point and click, and then there's a technique to grab the code. I'm going to demonstrate that. So tools like recode, generating a new formula column with the calculation, reshaping data tables, these tools are in the tables menu. There's stack, split, join, concatenate, and update. All of these tools in JMP 15 generate their own script after you execute them by point and click. There are other common tasks that do not generate their own JSL script and in order to make it easier to accomplish these tasks and make them reproducible, And it helps with the following tasks, mostly column stuff, changing the data types of columns, the modeling types, changing the display format, renaming, reordering, and deleting columns from your data table, also setting column properties such as spec limits or value labels. So the Data Cleaning Script Assistant is what you'll use to assist you with those tasks in JMP 15. We are going to give you a sneak preview of JMP 16 and we're very excited about new features in the log in JMP 16, I think it's going to be called the enhanced log mode. The basic idea is that in JMP 16 you can just point and click your way through your data curation steps as usual. The JSL code that you need is generated and logged automatically. All you need to do is grab it and save it off. So super simple and really useful, excited to show that to you. Here's a cheat sheet for your reference. In JMP 15 these are the the tasks on the left, common data curation tasks; it's not an exhaustive list. And the middle column shows how you accomplish them by point and click in JMP. The method for extracting the reusable script is listed on the right. So I'm not going to cover everything in here. But yeah, this is for you for your reference later. Let's get into a demo. And I'll show how to address some of those issues that Mia identified with the components data table. I'm going to start in JMP 15. And the first thing that we're going to talk about are some of those column problems, changing changing the data types, the modeling types, that kind of thing. Now, if you were just concerned with point and click in JMP, what you would ordinarily do is, for for let's say for humidity. This is the column you'll remember that has some text in that and it's coming in mistakenly as a character column. So to fix that by point and click, you would ordinarily right click, get into the column info, and address those changes here. This is one of those JMP tasks that doesn't leave behind usable script in in JMP 15. So for this, we're going to use the data cleaning script assistant instead. So here we go. It's in the add ins menu, because I've installed it, you can install it too. Data cleaning script assistant, the tool that we need for this is Victor the cleaner. This is a graphical user interface for making changes to columns, so we can address data types and modeling types here. We can rename columns, we can change the order of columns, and delete columns, and then save off the script. So let's make some changes here. For humidity, that's the one with the the N/A values that caused it to come in as text. We're going to change it from a character variable to a numeric variable. And we're going to change it from nominal to continuous. We also identified batch number needs to come...needs to get changed to to nominal; part number as well needs to get changed to nominal and the process, which is a number right now, that should also be nominal. fab tech. So that's not useful for me. Let's delete the facility column. I'm going to select it here by clicking on its name and click Delete. Here are a couple of those cosmetic changes that Mia mentioned. Scrap rate is at the end of my table. I want to move it earlier. I'm going to move it to the fourth position after customer number. So we select it and use the arrows to move it up in the order to directly after customer number. Last change that I'm going to make is I'm going to take the pressure variable and I'm going to rename it. My engineers in my organization called this column psi. So that's the name that I want to give that column. Alright, so that's all the changes that I want to make here. I have some choices to make. I get to decide whether the script gets saved to the data table itself. That would make a little script section over here in the upper left panel. Where to save it to its own window, let's save it to a script window. You can also choose whether or not the cleaning actions you specified are executed when you click ok. Let's let's keep the execution and click OK. So now you'll see all those changes are made. Things have been rearrange, column properties have changed, etc. And we have a script. We have a script to accomplish that. It's in its own window and this little program will be the basis. We're going to build our data curation script around it. Let's let's save this. I'm going to save it to my desktop. And I'm going to call this v15 curation script. changing modeling types, changing data types, renaming things, reordering things. These all came from Victor. I'm going to document this in my code. It's a good idea to leave little comments in your code so that you can read it later. I'm going to leave a note that says this is from the Victor tool. And let's say from DCSA, for data cleaning script assistant Victor. So that's a comment. The two slashes make a line in your program; that's a comment. That means that the program interpreter won't try to execute that as program code. It's recognized as just a little note and you can see it in green up there. Good idea to leave yourself little comments in your script. All right, let's move on. The next curation task that I'm going to address is a this supplier column. Mia told us how there were some problems in here that need to be addressed. We'll use the recode tool for this. Recode is one of the tools in JMP 15 that leaves behind its own script, just have to know where to get it. So let's do our recode and grab the script, right click recode. And we're going to fix these data values. I'm going to start from the red triangle. Let's start by converting all of that text to title case, that cleaned up this lower case Hersch value down here. Let's also trim extra white space, extra space characters. That cleaned up that that leading space in this Anderson. Okay. And so all the changes that you make in the recode tool are recorded in this list and you can cycle through and undo them and redo them and cycle through that history, if you like. All right, I have just a few more changes to make. I'll make the manually. Let's group together the Hershes, group together the Coxes, group together all the Andersons. Trutna and Worley are already correct. The last thing I'm going to do is address these missing values. We'll assign them to their own category of missing. That is my recode process. I'm done with what I need to do. If I were just point and clicking, I would go ahead and click recode and I'd be done. But remember, I need to get this script. So to do that, I'm going to go to the red triangle. Down to the script section and let's save this script to a script window. Here it is saved to its own script window and I'm just going to paste that section to the bottom of my curation script in process. So let's see. I'm just going to grab everything from here. I don't even really have to look at it. Right. I don't have to be a programmer, Control C, and just paste it at the bottom. And let's leave ourselves a note that this is from the recode red triangle. Alright, and I can close this window. I no longer need it. And save these updates to my curation scripts. So that was recode and the way that you get the code for it. All right, then the next task that we're going to address is calculating a yield. Oh, I'm sorry. What I'm going to do is I'm going to actually execute that recode. Now that I've saved the script, let's execute the recode. And there it is, the recoded supplier column. Perfect. All right, let's calculate a yield column. This is a little bit redundant, I realize we already have the scrap rate, but for purposes of discussion, let's show you how you would calculate a new column and extract its script. This is another place in JMP 15 where you can easily get the script if you know where to look. So making our yield column. New column, double click up here, rename it from column 16 to yield. And let's assign it a formula. To calculate the yield, I need to find how many good units I have in each batch, so that's going to be the batch size minus the number scrapped. So that's the number of good units I have in every batch. I'm going to divide that by the total batch size and here is my yield column. Yes, you can see that yield here is .926. Scrap rate is .074, 1 minus yield. So good. The calculation is correct. Now that I've created that yield column, let's grab its script. And here's the trick, right click, copy columns. from right click, copy columns. Paste. And there it is. Add a new column to the data table. It's called yield and here's its formula. Now, I said, you don't need to know any programming, I guess here's a very small exception. You've probably noticed that there are semicolons at the end of every step in JSL. That separates different JSL expressions and if you add something new to the bottom of your script, you're going to want to make sure that there's a semicolon in between. So I'm just typing a semicolon. The copy columns function did not add the semicolon so I have to add it manually. All right, good. So that's our yield column. The next thing I'd like to address is this. My processes are labeled 1 and 2. That's not very friendly. I want to give them more descriptive labels. We're going to call Process Number 1, production; and Process Number 2, experimental. We'll do that with value labels. Value labels are an example of column properties. There's an entire list of different column properties that you can add to a column. This is things like the units of measurement. This is like if you want to change the order of display in a graph, you can use value ordering. If you want to add control limits or spec limits or a historical sigma for your quality analysis, you can do that here as well. Alright. So all of these are column properties that we add, metadata that we add to the columns. And we're going to need to use the Data Cleaning Script Assistant to access the JSL script for adding these column properties. So here's how we do it. At first, we add the column properties, as usual, by point and click. I'm going to add my value labels. Process Number 1, we're going to call production. Add. Process Number 2, we're going to call experimental. And by adding that value label column property, I now get nice labels in my data table. Instead of seeing Process 1 and Process 2, I see production and experimental. Data Cleaning Script Assistant. We will choose the property copier. A little message has popped up saying that the column property script has been copied to the clipboard and then we'll go back to our script in process. from the DCSA property copier. And then paste, Control V to paste. There is the script that we need to assign those two value labels. It's done. Very good. Okay, I have one more data curation step to go through, something else that we'll need the Data Cleaning Script Assistant for. We want to consider only, let's say, the rows in this data table where vacuum is off. Right. So there are 313 of those rows. And I just want to get rid of the rows in this data table where vacuum is on. So the way you do it by point and click is is selecting those, much as I did right now, and then running the table subset command. In order to get usable code, we're going to have to use the Data Cleaning Script Assistant once again. So here's how to subset this data table to only the rows were vacuum is off. First, I'm going to use, under the row menu, under the row selection submenu, we'll use this Select Where command in order to get some reusable script for the selection. We're going to select the rows were vacuum is off. And before clicking okay to execute that selection, again, I will go to the red triangle, save script to the script window. Control A. Control C to copy that and let's paste that at again From rows. Select Where Control V. So there's the JSL code that selects the rows where vacuum is off. Now I need, one more time, need to use the Data Cleaning Script Assistant to get the selected rows. Oh, let us first actually execute the selection. There it is. Now with the row selected, we'll go up to again add ins, Data Cleaning Script Assistant, subset selected rows. I'm being prompted to name my new data table that has the subset of the data. Let's call it a vacuum, vacuum off. That's my new data table name. Click OK, another message that the subset script has been copied to the clipboard. And so we paste it to the bottom. There it is. And this is now our complete data curation script to use in JMP 15 and let's just run through what it's like to use it in practice. I'm going to close the data table that we've been working on and making corrections to doing our curation on. Let's close it and revert back to the messy state. Make sure I'm in the right version of JMP. All right. Yes, here it is, the messy data. And let's say some new rows have come in because it's a production environment and new data is coming in all the time. I need to replay my data curation workflow. run script. It performed all of those operations. Note the value labels. Note that humidity is continuous. Note that we've subset to only the rows where vacuum is off. The entire workflow is now reproducible with a JSL script. So that's what you need to keep in mind for JMP 15. Some tools you can extract the JSL script from directly; for others, you'll use my add in, the Data Cleaning Script Assistant. And now we're going to show you just how much fun and how easy this is in JMP 16. I'm not going to work through the entire workflow over again, because it would be somewhat redundant, but let's just go through some of what we went through. Here we are in JMP 16 and I'm going to open the log. The log looks different in JMP 16 and you're going to see some of those differences presently. Let's open the the messy components data. Here it is. And you'll notice in the log that it has a section that says I've opened the messy data table. And down here. Here is that JSL script that accomplishes what we just did. So this is like a running log that that automatically captures all of the script that you need. It's not complete yet. There are new features still being added to it. And I, and I assume that will be ongoing. But already this this enhanced log feature is very, very useful and it covers most of your data curation activities. I should also mention that, right now, what I'm showing to you is the early adopter version of JMP. It's early adopter version 5. So when we fully release the production version of JMP 16 early next year, it's probably going to look a little bit different from what you're seeing right now. Alright, so let's just continue and go through some of those data curation steps again. I won't go through the whole workflow, because it would be redundant. Let's just do some of them. I'll go through some of the things we used to need Victor for. In JMP 16 we will not need the Data Cleaning Script Assistant. We just do our point and click as usual. So, humidity, we're going to change from character to numeric and from nominal to continuous and click OK. Here's what that looks like in the structured log. It has captured that JSL. All right, back to the data table. We are going to change the modeling type of batch number and part number and process from continuous to nominal. That's done. That has also been captured in the log. We're going to delete the facility column, which has only one value, right click Delete columns. That's gone. PSI. OK, so those were all of the tool...all of the things that we did in Victor in JMP 15. Here in JMP 16, all of those are leaving behind JMP script that we can just copy and reuse down here. Beautiful. All right. Just one more step I will show you. Let's show the subset to vacuum is off. Much, much simpler here in JMP 16. All we need to do is select all the off vacuums; I don't even need to use the rows menu, I can just right click one of those offs and select matching cells, that selects the 313 rows where vacuum is off. And then, as usual, to perform the subset, to select to subset to only the selected rows, table subset and we're going to create a new table called vacuum off that has only our selected rows and it's going to keep all the columns. Here we go. That's it. We just performed all of those data curation steps. Here's what it looks like in the structured log. And now to make this a reusable, reproducible data curation script, all that we need to do is come up to the red triangle, save the script to a script window. I'm going to save this to my data...to my desktop as a v16 curation script. And here it is. Here's the whole script. So let's uh let's close all the data in JMP 16 and just show you what it's like to rerun that script. Here I am back in the home window for JMP 16. Here's my curation script. You'll notice that the first line is that open command, so I don't even need to open the data table. It's going to happen in line right here. All I need to do is, when there's new data that comes in and and this file has been updated, all that I need to do to do my data curation steps is run the script. And there it is. All the curation steps and the subset to the to the 313 rows. So that is using the enhanced log in JMP 16 to capture all your data curation work and change it into a reproducible script. Alright, here's that JMP 15 cheat sheet to remind you once again, these, this is what you need to know in order to extract the reusable code when you're in JMP 15 right now, and you won't have to worry about this so much once we release JMP 16 in early 2021. So to conclude, Mia showed you how you achieve data curation in JMP. It's an exploratory and iterative process where you identify problems and fix them by point and click. When your data gets updated regularly with new data, you need to automate that workflow in order to save time And also to document your process and to leave yourself a trail of breadcrumbs when you when you come back later and look at what you did. The process of automation is translating your point and click activities into a reusable JSL script. We discussed how in JMP 15 you're going to use a combination of both built in tools and tools from the Data Cleaning Script Assistant to achieve these ends. And we also gave you a sneak preview of JMP 16 and how you can use the enhanced log to just automatically passively capture your point and click data curation activities and leave behind a beautiful reusable reproducible data curation script. All right. That is our presentation, thanks very much for your time.  
Sports analytics tools are becoming more frequently used to help athletes enhance their skills and body strength to perform better and prevent injury. ACL tearing is one of the most common and dangerous injuries in basketball history. This injury occurs most frequently in jumping, landing, and pivoting due to the rapid change of direction and/or sudden deceleration in basketball. Recovering from an ACL injury is a brutal process, can take months – even years – to recover, and significantly decrease the player’s performance after recovery. The goal of this project is to find the relationship between fatigue and different angle measurements in the hips, knees, and back as well as the force applied to the ground to minimize the ACL injury risk. 7 different sensors were attached to a test subject while he conducted the countermovement jump for 10 trials on each leg before and after 2 hours of vigorous exercise. The countermovement jump was chosen due to its ability to assess the ACL injury risk quite well through force and flexion of different body parts. Several statistical tools such as the control chart builder, multivariate correlation, and variable clustering were utilized to discover any general insights between the before and after fatigue state for each exercise (which is related to an increased ACL injury risk). The JMP Multivariate SPC platform provided further biomechanic, time-specific information about how joint flexions differ before and after fatigue at specific time points, giving a more in-depth understanding of how the different joint contributions change when fatigued. The end-to-end experimental and analysis approach can be extended across different sports to prevent injury.   (view in My Videos)   Auto-generated transcript:  
Anderson Mayfield, Dr. (assistant scientist), University of Miami   Coral reefs around the globe are threatened by the changing climate, particularly the ever-rising temperature of the oceans. As marine biologists, we normally document death, carrying out surveys on degraded reefs and quantifying the percentage of corals that have succumbed to "bleaching" (the breakdown of the anthozoan-dinoflagellate endosymbiosis upon which reefs are based) or disease. Although these data are critical for managing coral reefs, they come too late to benefit the resident corals. Ideally, we should instead seek to assess the health of corals before they display more visible, late-stage manifestations of severe health decline. Through a series of laboratory and field studies carried out over the past 20 years, we have now developed a better understanding of the cellular cascades involved in the coral stress response; this has resulted in a series of putative molecular biomarkers that could be used to assess reef corals health on a proactive, pre-death timescale. In this presentation, I will review progress in reef coral diagnostics and show how I have used JMP Pro to develop models with the predictive capacity to forecast which corals are most susceptible to environmental change. A similar approach for instead identifying resilient corals will also be presented.     Auto-generated transcript...   Speaker Transcript Anderson Mayfield Alright everybody, thanks for tuning in. My name is Anderson Mayfield and I'm an assistant scientist at NOAA as well as the University of Miami. And today I'm going to be talking to you about some of the research I've been doing over the past 20 or so years on towards predicting the fate of reef corals. And because I'm going to try to cover a lot of material in the next 35-40 minutes, I want to go ahead and acknowledge my funding agencies right here from the get go, particularly NSF, NOAA, Living Oceans Foundation. A lot of the work I've been fortunate enough to be able to do over my career has taken place in remote locations using some pretty new technologies that tend to be expensive, so I've been really fortunate to partner with these excellent agencies that have supported this work. So I'm actually interested in all facets of coral biology from the basic cell biology. How do the cells function on a day in, day out basis. Cells are really crowded as I'm going to show you. I did a pretty long stint where I was basically doing the aquarium studies where I challenged corals with different stressors like temperatures and ocean acidification, PCO2 levels. and we were doing this in microcosms and mesocosms. We call this environmental physiology and...I'm actually going to show you some data on this today. What I've really gotten into the last few years, which is kind of the heart of this presentation is predictions. Here I'm referring to it is molecular diagnostics. How do we know if a coral is sick? We usually wait until it's almost dead to diagnose it. What I want to do is kind of try to push... push it from this retroactive diagnostic approach to a more proactive one. Can we detect subtle decreases in coral health before they start to get sick? Can we make predictions about which ones will fare best in the coming decades and which ones will not? So this is kind of the heart of what I do on my day in and day out basis, as well as one I'm going to talk about today. So I apologize to those of you have heard some of my JMP talks before; you're going to see a lot of repetition, but I basically want to get everybody on the same page with some of their coral reef biology that will be critical to understanding why I did things the way I did them. I'm going to talk about some problems facing coral reefs. Unfortunately, most of us now working in the coral reef field, not only just studying coral reefs because they're beautiful and fascinating, but also because they're dying at such rapid rates. Then I'm going to talk about two completely different, but potentially complimentary approaches for predicting coral fate. One of them's looking at basically global temperature trends from satellite data. This is what most people rely on as we speak right now. Then I'll talk to you about some new data and I'm going to use JMP to show you how quickly you can go from getting some some molecular analyte data right off the machine and go directly into making predictions using a cell-based approach. So instead of looking at global temperature patterns to make predictions about coral health, we're actually looking inside of the coral cells to see if we can diagnose anything that might be going awry. Because I'm going to try to cover so much in a relatively compressed amount of time, I'm inherently going to gloss over things. So if you have any questions, I'm uploading all the data I'm going to show you today. There's hidden slides. So there's some things that you may see me make in JMP and you don't understand what happened. They're actually within the PowerPoint. So if you go to download it, you can see the nitty gritty, step by step, analytical details. You can either do that. You can, you can, I'm making all the data open access, the PowerPoint, so email me. You can check out my personal website that I've listed here. I'm definitely happy to answer any questions. And I think these data are important. So, you know, they're not published yet. But, you know, feel free to take a look at them at you at your leisure. So I think most people have a general idea of what a coral reef is. I'm I just want to throw out the definition right from the get go, it's a calcium carbonate based structure. It's been built by an animal-plant symbiosis. They like warmer water. But as you'll see soon, not too warm. They tend to be near the equator and the tropics, tend to have very high biodiversity, of fish, invertebrates. As you can see from this picture, there's tons of hiding places for small organisms. This means they're important nurseries for commercially important fisheries and literally buffer coastline from...coastlines from storm and wave damage that provide a number of ecosystem services to humans. Frankly, that's not why I studied coral reefs. I study them because I think they're beautiful. I love to scuba dive. And I think the underlying framework-building organisms are really fascinating in and of themselves and thus, the reef coral. A reef coral, actually, it's not just an animal. It has plant cells specifically dinoflagellate algae within about half of its tissues. These dinoflagellates, which I've shown you here, this tongue twisting family name is Symbiodiniaceae, these dinoflagellates are photosynthetically active. They fix carbon from some sunlight, just like a plant would and they translocate the carbon they fix into their host Yes. Corals can feed; they've got specialized stinging cells called nematocysts that all cnidarians have, but frankly they rely on their endosymbiots for most of their nutritional needs. The endosymbiots are also getting something, they're getting the waste products from the coral for their metabolism. They're getting shelter. It's a really intricate mutualistic association that's been the focus of study for over 50 years, but there's still many key facets that we don't understand. Fact of the matter is, what you need to know for this talk is coral is about a two-third animal, one-third plant from a physiological physiologist perspective. So it's basically the physiological oxymoron, in the sense that it's an animal that can photosynthesize via this mutualistic symbiosis with the dinoflagellate algae. You want to understand the health of a coral, you're going to need to focus not just on the coral itself, but on its dinoflagellate endosymbiots. And it's actually goes beyond that. Corals host an entire mini ecosystem that we call the microbiome. There's a plethora of bacteria, viruses, fungi that live on, within, or even inside of the coral skeleton. I'm only really going to talk about the dinoflagellates and the coral host today, but suffice to say if you really want to have a holistic understanding you need to consider the, the whole what we call a quote unquote holobiont, which is an amalgamation of host and symbiot. So I think this is probably something that's already known to most to you. There's numerous global threats to coral reefs. There's things like coastal development, leading to seawater pollution. There's overfishing. In Florida, we are particularly concerned with disease. Right now there's one called stony coral tissue loss diseases ravaging Florida's reefs. But on a global scale, what I would, what I would hazard to say most coral biologists would put as the major threat to coral reefs is climate change. The reason these corals are basically already living near the upper threshold of their thermo tolerance. If you push that temperature a little bit higher than what they can take, they're going to bleach. So you're going to see what you see in this picture here. And what that means is they either lost their dinoflagellates. They've either expelled them, they've digested them. The dinoflagellants might still be in the coral, but they've lost their these corals can no longer photosynthesize, which means they can no longer nourish themselves. If they're not able to rapidly reacquire symbionts, or perhaps get a flush of ...a food rich seawater that goes past them, they're going to starve to death and die. So usually what you see is when you go about one degree Celsius above the trailing mean high temperature of the year, that's when you see bleaching. And this has been known for decades. So what NOAA did about 20 years ago, they started this program called Coral Reef Watch. It's really nice website. I recommend you to check it out. And it's based on a really simple idea. You expose corals to water that's hotter than normal for them for long periods of time, that's universally bad. The idea is known as the degree heating week, so I'll try to explain it. This is...I apologize for the low resolution figure, but it's it's their instructional one they include on their website. So let's walk through this figure, because this is important. So on the left y axis, you have SST that stands for sea surface temperature, that's satellite inferred temperature data. On the right we have a degree heating week, which is basically an integral of the degrees in Celsius above normal multiplied by the time. So in this case, that the hatched blue line at about 29 degrees is the trailing mean of the hottest month of the year, which in this case looks to be perhaps I guess in the summer. So you start accruing stress at one degree above this trailing monthly mean. So that would be 29 plus one is 30. So the time these corals spend above 30 degrees, and this example is from Palmyra Atoll, is when you're accruing these degree heating weeks. So if you have one degrees Celsius above the mean monthly high for four weeks, that would be four degree heating weeks. You had two degrees Celsius above the mean monthly high, so in this case 31, for two weeks, that would also be four degree heating weeks. So, this... their kind of underlying hypothesis is, you know, two degrees C above the mean monthly high for two weeks is going to result in kind of the same bleaching endpoint as the one degree for four weeks. From a physiologist physiologist perspective there might be some issues with this, but I think overall, this is a pretty good metric. And what their kind of underlying benchmark is, once you get above about four degree heating weeks, once you get into that four to eight window, you're probably going to start to see bleaching. So, they've got some really nice gifs on the website that are showing you past data. So this is looking at the trailing, I think, three or four months. So this is looking at the Caribbean and there, basically where you're seeing red, this is when you're having the degree heating weeks of four to eight. This is where they're predicting bleaching is actually going to happen. They don't do...NOAA itself doesn't do really any ground truthing so they rely on individuals to, you know, email and phone in saying, "Hey you know your prediction in this particular reef that I like is wrong. Or this was right." I think, you know, and these days for a lot of us are sitting at home. We can't dive. This kind of in silico science is, you know, maybe it's going to be the best we can do. And this summer is just, is watch how the temperatures progress and then hope the corals can weather the storm. And there's a totally different predictive, I would say, approach for looking at coral health and this is actually not looking at...so the Coral Reef Watch is looking kind of within a year, showing you the progression of bleaching over the hottest part of the year. And they actually extend four months into the future. So their algorithms will try to project as you would with weather, what's going to happen on those reefs in the coming four months. In this model, which is kind of a combination of the NOAA's Coral Watch with some some UN models based on IBCC climate change production...projections, it's looking at something a little bit different. So here, instead of seeing temperatures in the cells, you're actually seeing year of onset of annual severe bleaching. What does that mean? This is the year they predict, based on temperature records and protect...projected degree heating weeks, you're going to start seeing bleaching every year. And that's scary because if corals bleach for one year, in one hot summer, for instance, you might see some death but corals are going to recover. Some are going to acclimatize. You might even see adaptation When you start having repeat years of bleaching, where there's basically no respite, this is when corals will, coral reefs really start to degrade. And you can see this is scary, because if you look at the scale, 2015, that's obviously already past. We've got reefs (this is a map of South Florida here) we've got reefs that are already predicted to be bleaching in the entirety of these pixels which are about 25 by 25 kilometers, to already be bleaching every year. So what I'm interested in, I'm a a physiologist. I want...I don't want to try to contradict these models, what I want to try to do is delve into each of these 25 by 25 kilometers squares, because we know these are global scale models. They're not telling you that every single coral and in these pixels is going to bleach. We know there's going to be heterogeneity. We know there's going to be some stronger species. Even within species, there's going to be stronger genotypes. This is what people like me are interested in. Can we improve the spatial resolution of these models and maybe even the temporal resolution by factoring in data from the coral themselves? This is just looking at temperature. How could we merge this with the health of the actual corals? So prior to my coming to NOAA, a year and a half ago, there's actually been a great body of work done in the upper Florida Keys, which is excellent for me. We've already got coral reefs, we know everything about their oceanography, their ecology for the past 20 or 30 years. For somebody like me who wants to go take samples, do some analyses, make predictions, this is a really great setup to have. So I'm going to be talking about four sites in particular, really. I'm just going to be talking about Cheeca Rocks, but I just want to show you right here is we've got this The Rocks and Cheeca Rocks is our kind of our preferred inshore reef sites. Little Conch and Crocker Reef are offshore, there's actually many more than this. But we know from our large data set from the past 20 or 30 years, that these inshore reefs are more resilient. Corals are stronger, coral covers higher, they resist disease and high temperatures better. And this is because they've been stress hardened These environments are abysmal. So it's kind of a paradox, they are living in dirty turbid water that corals..high-nutrient water the corals normally wouldn't thrive in but they've adapted or perhaps acclimatized to living there. So now when there's future stressors that come at them, they're better able to resist them. So while the offshore reefs might be the ones you want to spend your vacation at, the water has that beautiful Caribbean look, but the corals look like you see on the right. They're just little patches of coral. Most of it has been killed by bleaching and disease. Inshore reefs, you can still see these large massive coral colonies. This species you see here is Orbicella faveolata, and that's going to be the one I feature in this talk. You actually can get some really nice images from Cheeca Rocks, in particular. Some National Marine Sanctuary on some NOAA websites, particularly this one, you can do a virtual dive. So let's look at the temperature data from Cheeco Rocks and kind of consider it in the context of what I was telling you earlier about those degree heating weeks and NOAA's coral watch predictions. So actually I made this in Graph Builder, I should mention that when I go to JMP later I'm actually using a beta version of 16. But as I'm not going to feature any tools that aren't already available to JMP 15 users. It's going to be very basic things. So there's nothing inherently JMP 16 specific that I'm going to show you in this talk. So here I use Graph Builder, what I've done, I've tried to kind of emulate NOAA's coral watch graph, but just focusing on Cheeca Rocks. So we've got the sea surface temperature on the left y axis, the degree heating weeks on the right. So the trailing mean temperature at Cheeca Rocks, so the highest temperature of the year is in July and August. And it's 31 degrees, which is already really hot for a coral reef. If you use the kind of UN NOAA recommendation of the mean monthly maximum plus one, that would be 32 degrees. So you would expect corals to start accruing stress above 32 degrees. You would calculate your degree heating weeks based on the time the temperature was above 32. However, we know that that's actually too hot. If you did your degree heating weeks based on 32, you would never reach that four degree heating week threshold, which you can see is the bottom-most hatched red line here. Instead, what I've done in this plot is I reset the degree heating week calculation in JMP to 31.3. That's kind of our bleaching threshold. We know since we go to these sites so often, 32 is too high. Once these corals are above 31.3 for four to eight weeks, that's when they start to bleach. So you know it's not to say that the NOAA coral watch models are bad, it's just in particular locations, they might need some tweaking with your own field observations. So this is actually looking at the years we've been studying Cheeca Rocks in detail. And you can see, we first started seeing bleaching in 2014 and 15 and the degree heating weeks in those years, based on our 31.3 threshold, which is the solid red line, was over 10, which is a lot. So we would definitely expect to see bleaching in those years. The anomaly is 2016. 2016 is kind of obscured in the graph, but you can see the top of that, the top-most red column is pretty much in that nine to 10 degree heating week ballpark. The reefs should have bleached that year, but they didn't. So that could be, you know, acclimatization or just recovery in general, but that's the anomalous year. 2017 was a cooler year even though we saw degree heating weeks of eight, the reefs didn't bleach, but then 2018 bleaching, 2019 bleaching. And then I just this morning updated this to zoom in to 2020 and sadly, when I first made this presentation a few weeks ago, you know, we didn't have any bleaching threats, but now you can see the red bars creeping above four as the temperature's been above 31.3 these past few weeks, so unfortunately we're probably going to have another bleaching year. So we have...the good thing about working with corals is they're sessile, so they don't move, except for when their larvae. So within these sites, not only can we go look at, you know, bleaching on a year in, year out basis, we've got tagged colonies. We can sample them at different times, we can see how their physiology changes in concert with these temperature changes. And then we could actually, we actually have the luxury, since we know what ended up happening each year with respect to bleaching, you know, we can hindcast. We know, hey, I took the sample in March; that coral bleached in August. Maybe we could look into those biopsies that we took earlier and see if we can detect any stress. What I really want to do is not so much look at the timeline of bleaching process, because I think Coral Watch is doing that well enough, more bleaching in the summer. We're not going to be able to improve upon that. Maybe you can get the timing a little bit better. What I really want to know is, why in this particular instance, do you see partially bleached (that's the PB), bleached (that's the BL), or the NB (not bleached), why do you see this heterogeneity? It's not as obvious except for maybe if you look in the bottom right, you see paling colonies. You see colonies that aren't paling. Why are some corals stronger than others? Can we use information from these colonies to basically make a test that would create a signature, a proteomic or a molecular signature of a stressed colony versus a resilient one. This is what I think would actually be more useful. Might not be able to to ...to know ahead of time which corals are going to bleach first or bleach second. But we might be able to know, hey, based on my biopsy, based on my molecular analysis and based on my resulting predictive model, this coral, I'm predicting, is going to be able to better resist bleaching from this one over here. So that's kind of my goal. And I've been trying to tackle this over the past, since I was a PhD student in Hawaii. It's mainly doing gene expression stuff, looking to see if we can find any gene ??? that shoot up an expression in corals that are more stress-susceptible. And maybe they are not used at all in resilient corals or maybe it's vice versa. Maybe there's genes that are only used by the hardy corals. These are things we can measure. I later found out that there's really no correlation between gene expression and protein concentrations in corals. So yes, you could still use the gene data for biomarker analysis. Gene expression data might still be really useful in making predictions about coral health, but if you want to know what's going on in the cells simultaneously, if you want to know about the cellular behavior, you actually need to look at the proteins. You can't infer the protein levels from the gene expression levels, because the R squared for corals is essentially, and for their symbiots, is essentially zero. So I made a kind of a dramatic shift in my research into proteomics about two or three years ago. So what I wanted to show you today originally, when I first wrote this abstract was field data from our tagged colonies to show how their protein signatures move over time for the stress-susceptible and the resilient corals. And that's going to be my number one thing I'm going to do when when our labs reopen. Unfortunately, we're not allowed to do lab work right now because of COVID. We're not actually allowed to do field work. So instead what I'm going to show you some experimental data that I think still might be useful at getting to some of these questions. So basically we have our favorite coral, Orbicella faveolata, here. This is a paling one from the upper Florida Keys. We took them from those four fields sites I showed you earlier, and we did a simple experiment. We did a five-day study at a controlled temperature of 30 versus a very high temperature of 33. These are going to be coded red in the figures, or maybe with a V which is for very high. And then we did some kind of a more realistic study with a 31-day exposure of the same control temperature, but to a more environmentally relevant high temperature of 32. You remember 32 is getting into that area where most corals can't resist exposure to 32 over prolonged period. So what I did...well, what I wanted to do was look into the cells of these corals and see if there any proteins that are only found in corals that proceed to bleach. Are there any that are only found in the resilient ones that are able to acclimate? So I took a shotgun proteomics approaches. It's a fancy way of just saying I just sequenced all the proteins that were in that sample with mass spectrometry. It's actually not simple at all if you get into the nitty gritty of the molecular analyses and the mass spectrometer. Not going to go into that today, just want to mention that we use this Thermo Fisher mass spectrometer called the Q Exactive. It's one of the best analytical instruments ever developed. It's amazing what it can do and its sensitivity. But I do want to reemphasize that when you do these types of analyses with coral, we're getting the coral proteins and you're getting dinoflagellate proteins. It's very important to look at both. You might have ....  
Serkay Ölmez, Sr Staff Data Scientist, Seagate Technology Fred Zellinger, Sr Staff Engineer, Seagate   With many users and multiple developers, it becomes crucial to manage and source control JSL scripts. This talk outlines how to set up an open source system that integrates JSL scripts with GIT for source control and remote access. The system can also monitor the usage of scripts as well as crash log collection for debugging. Other features such as VBA scripting for PPT generation, unit testing, and user customization are also integrated to create high quality JSL scripts for a wide range of user base.     Auto-generated transcript...   Speaker Transcript Serkay Olmez Hello this is Serkay and Fred from Seagate and today we are going to talk about how we built a JSL ecosystem to manage our JSL scripts. So we are using Git to manage the source controls, the source of our JSL scripts, as well as to distribute them. So that will be the main point of the talk today. But in addition to that, I will be also talking about VBA for PowerPoint integration and also crash log collection, as well as unit tests. I will jump to the outline quickly. So what I want to first talk about is about our history which JSL. I've been scripting in JSL for about 10 years or so. And we are very happy with where we are at now, but it took us quite a bit of time to get here and I want to talk about the milestones of our experience, what we did so far and why we did so. And then we'll be talking about Git and how we enabled Git to source control our JSL scripts and how it enabled us to build further features. For example, once you enable Git, you can add more features such as monitoring your scripts. So you know that the developers can know that their scripts are in use, so they can monitor the usage. And probably more importantly, they can also collect logs. If the script crashes, the developer know and he or she can go back and fix those bugs. And you can... you can add one more feature going one step further. You can even create automated bug tickets you already know that your script crashed. So you can automatically create a bug ticket for that. And you can track those tickets using a tracking software such as JIRA or Atlassian. And I will end up with our best practices and lessons learned so far. So, And in the appendix I also have a manual for a script I will be talking about. It's a script that can push images to PowerPoint. And I have a very detailed manual for that, and I should just let you know that everything I talk about here, the data and the scripts will be available. And they are posted in a public repository in the references section here and you can just go there and grab those files if you choose to do so. So let me start with with the brief history here. So 10...I've been working with JSL for about 10 years or so, and started, I started with very basics and I didn't really know much about JSL scripting. And then what once you start doing that you realize that you have to have some proper source control and you you need proper ways of distributing your scripts. So the thing about JMP scripting is that it has zero barrier to entry. So you can literally do a plot manually and then go grab the script. It's written for you. And you can also use Community to JMP in JMP.com to ask questions and get answers. And you...the scripting's so efficient that you're doing your job so efficiently and people will notice. They will ask you how you do things and you'll say I have a script for that. And they will ask, "Can you share that with me?" And all of a sudden you become a developer, although you didn't intend to do so. And then you have to deal with distributing your scripts. The first idea that comes to mind is just attach them, which is a horrible idea, and I've done that for quite a while. And it's kind of obvious why that's not a good idea, because you attach a script and then you send it out and then the next day you make a revision and then you have to send it again, and you don't even know if the user will go with this next one. So I'm just illustrating the point here. Recently, I got an email from a colleague and he was referring to a script I created in 2017 and I was just numbering my scripts with these version numbers, which which is not a good idea. So it comes back after three, four years and you realize that people are still using three year old script, because they didn't update. One way of solving this problem is to use shared drive and based on the interaction I had with people in in them in the Discovery Summits, is that many people, many companies are using this one. So what what developers do is to dump their scripts into a shared drive and users will be pulling directly from the shared drive, which solves half of the problem that distribution problem, but it doesn't do anything about source controlling. You don't...you cannot trace the changes you did in the code. And that's why we actually moved to Git. And that was a breakthrough for us and enabled lots of features. So what do you do with Git is that developers will push their scripts to a repository and users will be pulling their scripts directly from Git. So that was a big improvement for us and it enabled us to collect crash logs and usage of the scripts, etc. And I just want to talk about a couple more things I learned from the Summits as well. They have been quite useful to improve my scripting skills and I attended a summit in 2018 and I learned quite a bit about expressions, etc. So people may want to go back and listen to those presentations, because they do help with the scripting skills. And one other milestone for us was about the testing. So I was inspired by this talk in the Summit last year, which was about unit testing and that enables you to automatically test your scripts before you publish them. Today I will be mostly talking about integration tests because unit tests are...unit tests are required, but not sufficient. Because you can do a unit test you can test all of your modules and they check out fine. But when you put them together, they won't work. They will crash, as illustrated here. Each drawer is tested, probably, but when you put them together, they won't operate. One nice feature that helps the developer quite a bit is about log collection. It is so helpful to know that your script crashed and, you know, how it crashed. And you can do that by collecting the logs from users and you can go back and fix your script and push the changes, and users will have their fixed script right away. And the final feature we are rolling out rather recently is about automated bug reporting. Since you already have the crash report, why not act upon that information? And so we create an automated monitoring system, which will track those crash logs and it will create bug reports automatically so that the developer can work on those. And on top of it, you can add the user as a watcher so that the user will know that somebody is working on the problem. So that's, that's that solves the that solves the information gap between the user and the developer. So this shows the timeline and I'll be spending some time on the on the individual items, but we'll start with Git and I will build this quickly and let Fred to talk to this. 163084 Okay. So, Git is just a version controlling system, and I don't use the word version control. Actually I want to say that Git helps you by giving you, effectively, unlimited redo, even if you use Git on your own local computer only. It gives you the ability to track revisions of your files and go back to old ones, in case you decide that some work you did the last few weeks was incorrect and you want to go back to something from several weeks ago. So Git is just a software that you install on your local computer. It creates a repository, and that repository can then be posted up to other repositories. Such as GitLab or GitHub. TortoiseGit is a GUI interface to get on your local machine that makes things easier. So if, Serkay, if you could open the next slide. So the model that I've tried to get developers to use is on their local computer. The developers on local, have them install the Git client and start version controlling the files on their local PC. Once they get in the habit of doing that, then we can connect them to a remote repository and Git connects to remote repositories or SSH generally or other methods and they can push copies of their Git database up to the remote repository. Once it's up on the remote repository, then it's available for an HTTP web server to share back out and JMP can then point to the URLs and HTTP...HTTP web server and load scripts from it. So instead of using a shared drive, we're using an HTTP web server that was populated by a push to a Git repository. And then the bullet items down below just point out the the benefits of that and how exactly it works. We can go to the next slide. Serkay Olmez So I will take over here, Fred, and I will show you...show a very basic illustration implementation, and the scripts are available in the References. So, this will be a very basic code. And what I'm showing here is the code that the developer is developing. Its its its bare bare minimum, right. It's just a dialog box that says, hello world. And assume the developer wants to pass this code to users. So instead of giving this code, what the developer passes is this code, which is a link to the repository. So this is the repository to the hello JSL script which lives in there... in the remote repository. What the user does is, it just grabs...the user grabs the script from the URL. So this is static code. It doesn't need any change, so developer can change the script whenever he or she wants, but the user doesn't have to do anything. And I will just show you an illustration here and I just want to go to the full screen. Just, just illustrates how this thing works with a particular a GUI, which is get GitHub Desktop. So what you do is you install this software in your computer and create a local repository, and they'll keep track of the changes you you have done. For example, in this case, I just created this hello JSL script and this software will know that there is a change and it will be highlight it automatically. And what you do is, you first committ it to your local repository, which pushes those things into your local repo and then you will be pushing it up to the origin, which is a remote repository. Now I will be pushing it to to the remote which will make it available to the users. So once you do that, now the users will be able to pull this new code. And I'm now switching back to the user role here and this the script, the user runs, and when...you once you do that, the dialog box will show up. So you are running the script that you pulled from the, from the repository. So I will illustrated...illustrate this a little better with a more advanced code, which will be also related to PowerPoint. So people do lots of analysis in JMP, and many people still, at the end of day, want to push their results into PowerPoint. I know JMP has some capabilities to push images directly to PowerPoint, but we wanted a little more...something a little more sophisticated. We want to manage the template of the PowerPoint. We want to do some more stuff within the PowerPoint, decide how many, how many images we want to put per slide, etc. And in order to do that, you first need to connect JMP with PowerPoint and you can do it actually very... in a very good way so you don't even have to leave JMP. So what what you can do is you can locate where your PowerPoint executable file is, and it's typically under Program Files. And then you create a batch file that will trigger this PowerPoint executable, and it will go and grab the PowerPoint file you want to run and it will just run it. So you can do all of this in JMP. Basically what this does is it searches for the PowerPoint executable file. And once it locates it, it bundles it into this batch file. And it also includes the path to the PowerPoint you want to run and the macro and so the PowerPoint will have a macro in it to do the management within the PowerPoint. So this is how you tie JMP to PowerPoint and you're gonna have to even leave the JMP into have to do that. But the question is, how do you get this PowerPoint file to your users to begin with? You could ask him to go and download it, which is not ideal because you want to manage this automatically and at the same time you want to source control your PowerPoint as well. So it needs to be a good candidate; it needs to be a good part of the whole ecosystem. You don't want to put it outside. And one other requirement is that you don't want to download PowerPoint every time the script runs. So you want to download the PowerPoint only if it has changed in that image repository or something new with the PowerPoint. So, and a way of doing this is to use this little JMP script, a JMP command which is creation date. So what I'm doing here is to check the date of the local file. So if the local PowerPoint is older than what I have in the repository, then I will go and download it. So it will go and download the PowerPoint using HTTP request, and it looks something like this. So what you do is you just go check your PowerPoint in your local computer. If it doesn't exist, you can just go and pull it from the repository using HTTP request. If it exists, you like look at its date, and if it is old, you still go and pull the new one. So the bottom line here is that you can integrate PowerPoint seamlessly into JMP environment, so you can push your results into JMP, including tables, images. And you can do all of this without breaking source control and I will show a demonstration here. So this will be the code, for example, you would give to your users. Again, this is static code. It really has nothing except for a URL, which links at the script to the remote repository. So it will go and grab the JMP script from the repository and those scripts are again available and you can just pull them from the References. From the, from the, from the developer side, this is the actual script, right. This is the script, the developer has developed and he or she pushes it to remote repository. And the nice thing about this script is that, I just want to point out a couple of features here quickly. So if, for example, look at this. This is a standalone text script and you don't you don't have to distribute PowerPoint files or additional scripts separately. What you do is you link them in your script and they are linked in the repository here, including the PowerPoint. So this script will manage all the distribution. So, it will go and grab the PowerPoint. It will go and grab other additional files if needed. So everything is bundled in together into the script, so it does all the management for you. So let me show this quickly. I will pull this back and put it into full screen so you can see clearly. So this will be a demonstration of triggering PowerPoint automatically for it for a particular case. And what this script does is it, it gets the paths from the table, this a JMP table that includes lots of image paths, and those are referring to this particular server, and it will take those paths and it will just take the highlighted ones here and it will push them into PowerPoint. I will just run this and this script is available to you if you want to give it a try. And it's a fully functional useful script. So, and I am running this script here. Again, it is referring to the repository. So it doesn't have the row script, it's just pulling it from the repository. You run it, it retrieves the code, runs a dialogue. And I will simply run this. I won't go into the details, and once once you run it, it will just trigger the PowerPoint automatically. And PowerPoint will launch and then it will...it's starting...it's building the slides here. It will pull 20 images and this will take 10 or 20 seconds and then you will have the PowerPoint done. So this is the PowerPoint you get and everything is done automatically and everything was pulled from the repository. So what is next? How would you revise this PowerPoint? So, i assukme you want to make some changes to your template. And you can see more details about this in the appendix, but what I want to show is is how you change the script in the PowerPoint. And I'll go to the full screen, and I will just just run this. What we are starting from is what we what we're left with in the previous slide, right, so you had this these images. What I want to do is to change the template in a trivial way just illustrate the point. So you go to... you go to the macro, you scroll down and find the thing you want to change, and I will be doing a simple change here. I will be changing the header color just to make a point, right. So you change this and save it. And this is this is going real time. You save this and can delete the slides so that it goes faster to repository, because you will be pushing this to the repository. And then you push this out using Git GUI. It will go into the repository and it will make it available to your users. So all of your users will get the modification instantly. So, so you don't you don't have to ask them to go and update the PowerPoint or anything. This is just going to the repository. And I will switch back to JMP and in the user mode and then run the script again, which will retrieve the new PowerPoint. And if you run it again, it will go through the slides, it will create the slide. And what you will notice is that you do have the changes, which was about changing the color of the header line. So, what else do we have? Script monitoring. I think this was one of the best features we developed, because this gives the ability to the developer to see whether their script...his, his script is appreciated or not. Right. So is it run? Is somebody running it on a regular basis? So that's one benefit. The other benefit is about log...crash log collection. So if the script fails, you can capture the failure by using this log capture functionality of JMP. So you sandwich your subroutines into log capture. If it fails...the failing...the log of the failure will be returned into this log returned text. And then what you also do is you enclose it in a try, so that the script still survives to to do the reporting. And then you check whether it's empty. If it's empty, that means there was no crash at all, so your scripts survived. But if not, if it's not empty, that means there was a crash and you contained it. You can you you grab the log and now you can report it. And the way to do it is to use HTTP user ID, the script name, and the lognote, whether it's a crash... whether is was a crash or just a regular, run, maybe some performance metrics. So the bottom line is you can transmit some some some data metadata about your script back to the developer. And what the server will do is, is to log them and you will have a set of files stored in the server and you can monitor them. And I'm just showing a sample table out of these. It's a crash log, which has...which has the date stamp, it has a script name, so this script has failed at the date with this particular crash. So this is extremely useful for the developer because he or she can go back and fix the issue. So I will collapse this and this and I will do a quick quick demonstration here that shows how you how you contain the crash and how you report it to the user. So what I will do here is to make a subroutine crash. I will create an undefined parameter and JMP will complain. It will it will say this thing is not defined, so the script is is crashing. But what I will do is I will call that subroutine in a log capture functionality and everything will be sandwiched in under try. So although the script crashed with the subroutine, it will survive overall, and it will it will be able to report a crash report. So do you give it nice nicely formatted notification to the user and say that. So the scripts crash with this particular error and we created a log for it and we are working on it. So that's, that's the notification you give to them...to the user. So one one key thing I learned in the Summits was about automated testing. So I, I used to do my testings manually, which is which is very frustrating. Because it takes time and you cannot capture each and every corner of your script. It's, it's impossible to do hours of testing when you when you do a small change. So I started getting into the automated testing and it is it's very important to do unit testings. So, you know, testing refers to the testing of individual subroutines. So you have a function and you want to test it multiple times before you put it into into your overall system, so you can do these unit tests for individual modules, but it won't be enough. And I can show you a couple of examples of that. For example, NASA lost their Mars Climate Orbiter for a very strange...because of a very strange error because there were two software teams, one in Europe, one in the US. And one of them was working with the units of pounds and the one in the, yeah that was the one in the US, and the one in Europe was using Newtons. So they they certainly did their unit tests. But when they put together their code, it didn't work, because one was expecting Newtons and the other one was getting pounds. So they they literally lost the Orbiter, just because they they forgot to convert from pounds to Newtons. And more recently as Starliner, Boeing lost their Starliner and then they admitted that they could have caught the error if they had done a rigorous integration test. And at the end of day, what counts is the integration test because modules don't live alone in the script, they talk together. And this is a funny illustration of the problem here. So you can have two objects which are tested thoroughly. It's a window, right, what can go wrong with a window? But if you have two of them put together, they won't even open, so you have to do some integration tests to see if they are combined, if they work okay together. And I want to illustrate a quick point here that shows how I started doing automated testing. And this will refer to a particular case, which is a difficult case, to be fair, because this will be testing of a modal window and modal windows won't go away unless you click on click on a button. So you have to create a JMP script that clicks on a button, that kind of mimics a human behavio,r and what I'm doing here is to is to inject a tester into into the modal window here. And the nice thing about this approach, I think, is that you can still distribute this thing, this script to your users and the users won't even realize that there's actually some testing routine in it. So it has the hooks for a tester, but then you run this alone, what it does is it looks for this test mode parameter and it's not set. So if it's not set the script will set it to zero that that will disable all the hooks in the table. So you can give this to your users and they can run it as this. However, if you want to test your script, what you do is you build a tester code on top of it. So the tester code will set the test mode to one and it will also create the tester... tester object you want to inject into your script. For example, this particular case, what it does is, it's selecting a particular column and then it's assigning it into a role by clicking a button and then it's running the script, right. So that's literally mimicking a human behavior. So then it will load the actual script. So it's basically injecting those parameters into the script that you want to test. Now you're running the script automatically, so it will load the script and run it, and it will close the UI after clicking this button and that button. And then what you do is you run it again in a log capture functionality, so that you know if something goes wrong, you will be capturing the failure. And you also put it in the enter try so that overall, the script survives to report the log. And then if nothing was wrong, you will have empty log return and that means your script did not fail. If something has gone wrong, you will know it in the log capture. So that's the idea. So, We have the ability to monitor the scripts. We have the ability to capture logs of crashes. So we thought why, why don't we act on the crash log? If you see a crash log that means something has gone wrong, and script that...the user already knows that because the script crashed obviously, and the developer knows that the script crashed because the logs are stored in a server. So they are there for you to act on. But the thing is, this is not a closed loop yet, because the user doesn't know that the developer knows. And you can close that loop by creating automated tickets. So since you have the information already, you can use a bug tracking software such as the JIRA Atlassian and then you can use the REST functionality to collect the information from the server, create a ticket, assign it to the developer, and you can also assign the user as a watcher. The watcher means that whenever the developer does anything about the bug and enters that information into the, into JIRA ,that will be looped back to the to the user so he or she will be notified and will know that somebody is working on the bug. And there are multiple ways of doing it, depending on the flavor of JIRA you have, and I am just giving you the basic code here. It's it's a curl code that you can create and you collect the metadata that you have from the crash logs and you embed that into a JSL file and then it will be transmitted through REST and it will go into the JIRA, which is...which will track it for you. And then JIRA will, if you set up properly, JIRA will notify it...JIRA will notify the developer and possibly the user for which...for whom the script crashed. And so this is all tied together and and the developer will know there's a bug that he or she needs to work on and the user will know that somebody will be working on the bug. Okay. I think this takes me to my closing notes. Just the takeaways you can get out of this presentation. Git has been the cornerstone of our system and it has it has enabled us to do lots of nice features and I didn't even mention the basic ones, which which are kind of obvious, because that gives you the ability to collaborate as well. So if you're multiple people working on the same script. You can use all the functionalities of the  
Steve Figard, Director of Cancer Research Lab, Bob Jones University   Teaching introductory biostatistics to sophomore level biology majors presents some interesting challenges, most notably, the inherent fear of statistics common to those who aren’t really interested in the topic but are being forced to take the course.  Many such students are, at best, uncomfortable with, or worse, totally petrified of, this topic.  Some insights into how to go about this without causing panic attacks in such students will be shared.     Auto-generated transcript...   Speaker Transcript Steve OK, so this is teaching introductory biostatistics using JMP overcoming the fear factor. My name is Steve Figard I'm the director of the Cancer Research Lab at Bob Jones University. I have actually also published a book. Through SAS press biostatistics using jump. Our introduction to biostatistics with Jim, I guess the, the official title. And what we want to cover today is how to teach or some thoughts. On how to teach introductory biostatistics to sophomore level biology majors and part of the reason this is such an interesting topic is there is a problem that We don't always realize, particularly those of us that like to go to discovery conferences, and that's the fact that we are the outliers in this we enjoy statistics. We love doing data analytics and we don't realize that there are that that actually makes this sort of a weird I have a good friend here on faculty who is a medical doctor and he wants saw me with my I love bias. I love statistics will help him on that I got at the Trump discovery conference. And he looked at me and said, you realize that proofs are clinically insane. Or my response back to him was. Yeah, but you know you need people like us, we sometimes forget that there are a lot of people that actually are scared of statistics. And biology majors in particular, particularly those that are starting off identify very much with this quotation from biostatistics text. Let us consider the likely scenario that you are a student of the bio sciences, whether you were at biomedical a physiologist and behavioral ecologist, or whatever. You'd like learning about living things you enjoy learning about the human body bugs and plants. Now, lo and behold, you have been forced to take a course that will make you do things with numbers and tread. Oh, dread even do something with numbers using a computer. You have probably decided that people who are making you do this or mindless status. So I'm those of us those there's actually a large body of people out there. The thing we in fact our mind, let's say to us, or at least clinically insane. And just to prove the point if you don't think that's the case, I start off my course asking my students to provide a one more description of how they feel. One day, one about taking my course. And this is the workflow that I actually created using the text explorer in jump. And you'll notice that The majority tend to be nervous or unsure or afraid and anxious, although we do have some that are excited or intrigued. When I find intriguing that Anyone would use the word a trained to describe about statistics course that's the problem before. So how do we deal with that. So I want to share some insights that I've gained to date using junk. don't require memorization, because in real life statistics are generally accomplished in consultation with others using other resources, besides yourself. You don't have to rely upon your memory. And how can we go about doing that. Well, we have plenty of in class exercises and case studies that are open book and open teacher This allows us to also overcome the initial learning curve for learning how to use jump because these students have never jumped before. So it's not just statistics that they're learning, but they have to actually learn the software package. By working with open book and open teacher, being able to ask questions as they start off, they can Learn these things, not by memorizing them, but just by prep from practice. I encourage collaboration amongst themselves that they don't just rely upon me to teacher to Learn these things that for successful exams that they have to take, but they are encouraged to collaborate with each other and teach help teach each other so that Any teacher will tell you get that you've learned best whenever you actually have to teach someone else. So I encourage collaboration in class. I use take home exams, because again, in real life, you generally don't have one hour to solve the problem, you are Told to go finish working a problem and report back as soon as possible and you generally have more time to Look at other resources and try to figure things out. So I use take home exams, of course, that does Fred provide the potential of cheating on the exam, but Generally, I can tell. You can tell pretty easily if someone's doing if you set up the exam properly, you can tell if they're they're cheating or not. Another point is I provide plenty of information and resources and flow charts and things so that jump is actually not an issue that learning how to use jump. I provide them tools that allow them to and if they figured out that they want to do a T test where in jumped you find the t test. So it jump itself is not the obstacle to learning the by statistics are learning how to Use the tools or to interpret the results they force actually has more stress on how to interpret the results and actually achieve real life solutions, then it does the mechanics of jump. Another point I found that is seems to be helpful is to share your enthusiasm. Whenever you present or teach the course. Make sure that as you're standing in front of your students. They can see that you're having fun up there. You want to approach your class as a mentor. Wanting to help your students not as the grand Grand Poobah and high executioner statistics that they should bow down before Don't be afraid to admit that you don't know something, of course you want to if it's something important that they they need to know and you really should know, then you get back to them and you go look it up yourself. You want to also show your enthusiasm by using examples that show the relevance of biostatistics to what the students actually need. My first lecture on the course. Actually, I showed them literature from the side from the biomedical and biological literature abstracts figure legends and things that incorporate Statistical terminology that if you don't have any clue what that means. You have no idea how to read that paper have no ideas on how to understand that paper so I you want to use examples that are relevant to the different areas of study that the students are actually interested in And last point I would say is employee as much humor as you can muster and your interactions with your students, you want to be relaxed and interact with your students as much as possible. Part of that is you want to begin your collection of means now so that whenever you actually Show things that I mean there's actually a fair amount of very humorous cartoons and things out there related to statistics and to some of the foibles because work via statisticians statisticians in general are still human beings and we still have our foibles and follies so There's More I could say, but we're limited because this is a poster and I will Welcome. Any further comments or thoughts in the future through communications with through sharp and Encourage you shameless plug here to look up my introduction to biostatistics with jump so that you can potentially see whether or not that might be useful in your own efforts to understand statistics and or to actually use it to teach your students.  
Steve Figard, Director of Cancer Research Lab, Bob Jones University Evan Becker, student, Bob Jones University Luke Brown, student, Bob Jones University Emily Swager, student, Bob Jones University Rachel Westphal, student, Bob Jones University   Colorectal cancer is both the third most common and the third leading cause of deaths associated with cancer.  Previous studies in this lab have demonstrated the in vitro cytotoxic effect of almonds on human GI cancers.  An almond extract was prepared and processed using a pseudo-digestion procedure in order to mimic the effects the extract would have in a physiological system.  This extract demonstrated a dose response in vitro cytotoxicity to human gastric adenocarcinoma and was cytotoxic to a human colorectal adenocarcinoma, but had no effect on a healthy human colon epithelial cell line.  The extract was processed through filters with molecular weight cutoffs of 100,000 Da and 5,000 Da to estimate the size of any anticancer molecules, and it was found that the responsible molecules were less than 5,000 Da in molecular weight.  In addition, a polyphenol extract of the almonds was prepared and shown to have similar effects as the whole almond extract. It was concluded that the anticancer agents are likely polyphenols. Finally, four flavonoids commonly found in almonds were compared to the polyphenol extract, and they showed similar cytotoxicity.     Auto-generated transcript...   Speaker Transcript Evan Becker Hello, this is the 2019 BJU cancer research team. My name is Evan Becker and along with me today are Luke Brown, Emily Swagger, Rachel Westfall and we are all under the direction of Dr. Stephen Figard at the Department of Biology and Bob Jones University. And our paper, our presentation is on the molecular weight characterization of an almond component cytotoxic to gastrointestinal cancer cell lines. All right, for the introduction. Colorectal cancer is currently of great concern in the medical community as it is the third leading cause of cancer deaths for both men and women nationwide. Previous research from the BJU cancer lab has shown promise by demonstrating that almonds have a cytotoxic effect on LoVo colorectal cancer in vitro. A pseudo digestion procedure for an almond extract was also used as a way to mimic how the extract would work in a physiological system. The same almond extract was shown to induce a dose response and the human gastric adenocarcinoma cancer cell line AGS. Also this extract causes no negative effects in the human colon epithelial cell line CCD, indicating that the cytotoxic effects of almonds do not affect normal healthy cells. Passing our almond extract through molecular weight cut off filters of 100,000 Daltons and 5000 Daltons respectively, we were able to determine that the molecules present in the almonds inducing the set of toxic effect must be smaller than 5000 Daltons. As a result, polyphenols were determined to be a possible cause of the cytotoxic effect and the polyphenol extract was conducted on the almonds with this treatment showing very similar cytotoxic effects on the cancer cell lines. I think we're ready. Rachel Westphal To the cell lines that we use included AGS, which is stomach cancer; LoVo, which is colon cancer; and CCD, which was our normal cell line. We used 5-fluorouracil as our positive control for cell death. And we used an in vitro pseudo digestion of the almonds to mimic physiological digest...physiological digestion. WST cell proliferation assay, we utilize that to determine absorbance with a plate reader and use that to calculate percent viability. And then JMP was used for statistical analysis, we used ANOVA analysis, the Tukey-Kramer HSD test and the Wilcox and non parametric comparison. p values less than 0.05 were considered statistically significant. Can go to the next slide. Yeah. Luke Brown All right. The first set of tests I'd like to introduce you to are tests regarding the AGS human cancer cell line and this is a human gastric adino carcinoma. Now JMP was an important tool for us because it helped us to, first of all, determine the standard deviation in a few pilot studies we conducted. This then allowed us to use the software to run a power analysis to determine our sample size. Now looking at some of the data we got here, first of all, you'll see among both Figure 1 and Figure 2 what it has in common is PBS phosphate buffered saline and 5-FU or 5-fluorouracil, which is well documented and established cancer treatment for cancer, such as gastric and colorectal cancers. Now I'd like to direct your attention to Figure 1 here. Previous studies in our lab had already established that it seems almond extract does have some sort of cytotoxic effect that's selective to these cancer lines. Well, we hope to establish in Figure 1 here is, first of all, the dosage effect. And second of all, we wanted to narrow down what molecular weight, we'd be looking at to establish what compound or compounds are responsible for this effect. As you see, moving up from 12% almond extract all the way to 100% almond extract, we do see a dosage response. And we were able to establish this using the Tukey-Kramer HSD test inside JMP to be able to establish that these groups are statistically significantly different. Now you see that both the 100,000 molecular weight cut off filter and less than 5000 molecular weight cut off filter are statistically the same as the 100% almond extract. This led us to believe that whatever compound or compounds is responsible for this effect is going to be relatively small at less than 5,000 Daltons. Given this, we moved on to Figure 2 and we looked at some research showing that flavonoids have been shown to have a similar effect in walnuts. cyanidin, delphinidin, malvidin and petunidin. As you can see both the extract and all these flavonoids were actually more effective than the 5-FU in this treatment. So given this information, I'm going to pass the next slide to Emily here to give you some more information about another set of data. Emily So to make sure that we weren't just looking at results particular to AGS, we also ran another cancer cell line called LoVo. LoVo is a colon cancer cell line. As you can see, we didn't do all of the extensive dosage treatments on this particular one, because we'd kind of shown that with AGS. We were just particularly looking at is this particular to a certain line? So for LoVo, you can see that we don't have as low a value for 5-FU.  
Stanley Siranovich, Principal Analyst, Crucial Connection LLC   Much has been written in both the popular press and in the scientific journals about the safety of modern vaccination programs. To detect possible safety problems in U.S.-licensed vaccines, the CDC and the FDA have established the Vaccine Adverse Event Reporting System (VAERS). This database system now covers 20 years, with several data tables for each year. Moreover, these data tables must be joined to extract useful information from the data. Although a search and filter tool (WONDER) is provided for use with this data set, it is not well suited for modern data exploration and visualization. In this poster session, we will demonstrate how to use JMP Statistical Discovery Software to do Exploratory Data Analysis for the MMR vaccine over a single year using platforms such as Distribution, Tabulate, and Show Header Graphs. We will then show how to use JMP Scripting Language (jsl) to repeat, simply and easily, the analysis for additional years in the VAERS system.     Auto-generated transcript...   Speaker Transcript Stan Siranovich Good morning everyone. Today we're going to do a exploratory data analysis of the VAERS database. Now let's do a little background on what this database is. VAERS, spelled V-A-E-R-S, is an acronym for Vaccine Adverse Effect Reporting System. It was created by the FDA and the CDC. It gets about 30,000 updates per year and it's been public since 1990 so there's quite a bit of data on it. And it was designed as an early warning system to look for some effects of vaccines that have not previously been reported. Now these are adverse effects, not side effects, that is they haven't been linked to the vaccination yet. It's just something that happened after the vaccination. Now let's talk about the structure. VAERS VAX and VAERS DATA. Now there is a tool for examining the online database and it goes by the acronym of WONDER. And it is traditional search tool where you navigate the different areas of the database, select the type of data that you want, click the drop down, and after you do that a couple of times, or a couple of dozen times, what you do is send in the query. And without too much latency, get a result back. But for doing exploratory data analysis and some visualizations, there's a slight problem with that. And that is that you have to know what you want to get in the first place, or at least at the very good idea. So that's where JMP comes in. And as I mentioned, we're going to do an EDA and some visualization on on specific set of data, that is data for the MMR vaccine for measles, mumps, and rubella. And we're going to do for the most recent full year available, which will be 2019. So let me move to a new window. Okay, the first thing we did and which I omitted here was to download the CSVs and open them up in JMP. Now I want to select my data and JMP makes it very easy. After I get the window open, I simply go through rows, rows selection and select where and down here is a picture that I want the VAX_TYPE and I wanted it to equal MMR. Now there's some other options here besides equals, which we'll talk about in a second. And after we click the button, and we've selected those rows, the next thing we want to do is decide on which data that that we want. So I've highlighted some of the columns and in a minute or so you'll see why. And then when I do that, oh, before we go there, let's note row nine and row 18 right here. Notice we have MMRV and MMR. MMRV is a different vaccine. And if we wanted to look at that also, we could have selected contains here from the drop down. But that's not what we wanted to do. So we click OK and we get our table. Now what we want to do is join that VAERS VAX table which contains data about the vaccine, such as a manufacturer, the lot and so forth with the VAERS DATA table, which contains data on on the effects of vaccine, so it's it's got things like whether or not the patient had allergies, whether or not the patient was hospitalized, number of hospital days, that sort of thing. And it also contains demographic data such as age and sex. So what we want to do is join and simply go to tables join and we select The VAERS VAX and VAERS DATA tables and we want to join them on the VAERS ID. And again, JMP makes it pretty easy. We just click the column in each one at one of the separate tables and we put them here in the match window and after that we go over to the table windows and we select the columns that we want. And this is what our results table looks like. Now let me reduce that and open up and JMP table. There we go, and I'll expand that. And for the purposes of this demonstration I just selected these...these columns here. We've got the VAERS ID, which you see identification obviously, the type which are all MMR. And looks like Merck is the manufacturer. And there's a couple of unknowns scattered through here. And I selected VAX LOT, because that would be important if there's something the matter with one lot, you want to be able to see that. This looks like cage underscore year, but that is calculated age in years. There are several H columns and I just selected one. And I selected sex because we'd like to know if somebody is is more affected, if males are more affected than females or vice versa. And HOSPDAYS is the number of days in the hospital if they had an adverse effect that was severe enough to put them into the hospital. And NUMDAYS is the number of days between vaccination and the appearance of the adverse effects and it looks like we we have quite, quite a range right here. So let's, let's get started on our analysis. show header graphs. So I'm going to click on that and show header graphs. And we get some distribution, and some other information up here. We'll skip the ID and see that the VAX_TYPE is all MMR, you have no others there. And the vax manufacturer, yes, it's either a Mercks & Co Inc or unknown and one nice feature about this is we can click on the bar and it will highlight the rows for us and click away and it's unhighlighted. Moving on to VAX_LOT, we have quite a bit of information squeezed into this tiny little box here. First of all, we have the top five lots in occurrence in in our data table and here they are, and here's how many times they appear. And it also tells us that we have 413 other lots included in table, plus five by my calculation, that's something like 418 individual lots. Now we go over the calculated age in years and in we see most of our values are between zero and whatever, they're during zero bin, which makes sense because it is a vaccination and we'll just make a note of that. And we go over to the sex column and it looks like we have significantly more females than males. Now, that tells us right away if we want to do, side by side group comparisons, we're going to have to randomly select from females, so that they equal the males and we also have some unknowns here, quite a few unknowns. And we simply note that and move on. And we see hospital days. And we've see NUMDAYS. Now here's another really really nice feature. Let's say we'd like more details and we want to do a little bit of exploration to see how the age is distributed, we simply right click, right click, select open in distribution. And here we are in the distribution windows, but quite a bit of information here. For our purposes right now, we don't really do much here about the quantiles. So let's click close and it's still taking up some space. So let's go down here and select outline close orientation and let's go with vertical. And we're left with a nice easy to read window. It's got some information in there. We of course see our distribution down here and we've got a box and whisker plot up here. There's not a whole lot of time to go into that, that, that just displays data in a different way. And we see from our summary statistics that the mean happens to be 16.2, with the standard deviation 20.6. Not an ideal situation. So if you want to do anything more with that, we may want to split the years in two groups where most of them are down here and and then where, where this, where all the skewed data is and then the rest of them and along the right and examine that separately. And I will minimize that window and we can do the same with hospital days and number of days. And let me just do that real quick. And here we see the same sorts of data and I won't bother clicking through that and reducing it. But we might note also when again we have the mean of 6.7 and standard deviation of 13.2, again, not a very ideal situation and we simply make note of that. And I will close that. Now let's say we want to do a little bit more exploratory analysis, something caught our eye and all that. And that is simple to do here. We don't have to go back to the online database, and select through everything, click the drop downs, or whatever. We can simply come up here to analyze and fit Y by X. So let's say that we would like to examine the relationship between oh, hospital days, number of days spent in the hospital and calculated age in years. We simply do that. We have two continuous variables so we're going to get a bivariate plot out of that. We click OK. And we get another nice display of the data. And yes, we can see that currently, the mean is down around 5 or 6, which is a good, good thing better than 10 or 12. We can, for purposes of references, go up here to the red triangle, select fit mean and we get the mean, right here. And we noticed there's quite a few outliers. Let's say we want to examine them right now and decide whether or not we want to delve into them a little bit further. So if we hover over one of our outlier points or any of the points for that matter, we see we get this pop up window and it tells us that particular data point represents row 868. Calculated age is in the one year bucket, and this patient happened to spend 90 days in the hospital. Now we could right click and color this row or put some sort of marker in there. I won't bother doing that, but I will move the cursor over here into the window, and we see this little symbol up in the right hand corner, click that and that pins it. So we can, of course, repeat that. And we can get the detail for further examination. I found this to be quite handy when giving presentations to groups of people like to call attention to one particular point. That's a little bit overbearing so let's right click, select...select font, not edit. And we get the font window come up and see we're using 16 point font. Let's, I don't know, let's go down to 9. And that's a little bit better and it gives us more room if we'd like to call attention to some of the other outliers. So in summary, let me bring up the PowerPoint again. In summary, we were able to import and shape two large data tables from a large online government maintained database. We were able to subset tables, able to join the tables and select our output data all seamlessly. And we were able to generate summaries and distributions, pointing out the areas that may be of interest and for more detailed analysis. And of course, that was all seamless and all occured with within the same software platform. Now, supply some links right over here to the various data site. This, this is the main site, which has all the documentation that the government did quite a good job there. And here is the actual data itself in the zip..  
Ruth Hummel, JMP Academic Ambassador, SAS Rob Carver, Professor Emeritus, Stonehill College / Brandeis University   Statistics educators have long recognized the value of projects and case studies as a way to integrate the topics in a course. Whether introducing novice students to statistical reasoning or training employees in analytic techniques, it is valuable for students to learn that analysis occurs within the context of a larger process that should follow a predictable workflow. In this presentation, we’ll demonstrate the JMP Project tool to support each stage of an analysis of Airbnb listings data. Using Journals, Graph Builder, Query Builder and many other JMP tools within the JMP Project environment, students learn to document the process. The process looks like this: Ask a question. Specify the data needs and analysis plan. Get the data. Clean the data. Do the analysis. Tell your story. We do our students a great favor by teaching a reliable workflow, so that they begin to follow the logic of statistical thinking and develop good habits of mind. Without the workflow orientation, a statistics course looks like a series of unconnected and unmotivated techniques. When students adopt a project workflow perspective, the pieces come together in an exciting way.       Auto-generated transcript...   Speaker Transcript So welcome everyone. My name is 00 07.933 3 Ambassador with JMP. I am now a retired professor of Business 00 30.566 7 between a student and a professor working on a project. 00 49.700 11 12 engage students in statistical reasoning, teach that 00 12.433 16 to that, current thinking is that students should be learning about reproducible workflows, 00 36.266 21 elementary data management. And, again, viewing statistics as 00 58.800 25 26 wanted to join you today on this virtual call. Thanks for having 00 20.600 30 and specifically in Manhattan, and you'd asked us so so you 00 36.433 34 And we chose to do the Airbnb renter perspective. So we're 00 51.733 38 expensive. So we started filling out...you gave us 00 09.166 43 44 separate issue, from your main focus of finding a place in 00 36.066 49 you get...if you get through the first three questions, you've 00 54.100 53 know, is there a part of Manhattan, you're interested in? 00 11.133 58 repository that you sent us to. And we downloaded the really 00 26.433 32.866 63 thing we found, there were like four columns in this data set 00 46.766 67 figured out so that was this one, the host neighborhood. So 00 58.100 71 72 figured out that the first two just have tons of little tiny 00 13.300 76 Manhattan. So we selected Manhattan. And then when we had 00 29.700 80 that and then that's how we got our Manhattan listings. So 00 44.033 84 data is that you run into these issues like why are there four 00 03.300 88 restricted it to Manhattan, I'll go back and clean up some 00 18.033 92 data will describe everything we did to get the data, we'll talk 00 28.400 33.200 97 know I'm supposed to combine them based on zip, the zip code, 00 47.166 101 102 107 columns, it's just hard to find the 00 09.366 106 them, so we knew we had to clean that up. All right, we also had 00 27.366 111 journal of notes. In order to clean this up, we use the recode 00 45.500 115 Exactly. Cool. Okay, so we we did the cleanup 00 02.200 119 Manhattan tax data has this zip code. So I have this zip code 00 19.300 123 day of class, when we talked about data types. And notice in the 00 42.300 128 the...analyze the distribution of that column, it'll make a funny 00 03.200 133 Manhattan doesn't really tell you a thing. But the zip code clean data in 00 18.466 23.266 139 just a label, an identifier, and more to the point, when you want to join or merge 00 41.833 48.766 145 important. It's not just an abstract idea. You can't merge 00 03.166 11.266 150 nominal was the modeling type, we just made sure. 00 26.200 31.033 155 about the main table is the listings. I want to keep 00 45.533 159 to combine it with Manhattan tax data. Yeah. Then what? Then we need to 00 03.266 164 tell it that the column called zip clean, zip code clean... Almost. There we go. And the column called zip, which 00 33.200 171 172 Airbnb listing and match it up with anything in 00 57.033 177 178 them in table every row, whether it matches with the other or 00 13.233 182 main table, and then only the stuff that overlaps from the second 00 29.600 186 another name like, Air BnB IRS or something? Yeah, it's a lot 00 50.966 190 do one more thing because I noticed these are just data tables scattered around 00 06.666 195 running. Okay. So I'll save this data table. Now what? And really, this is the data 00 19.833 22.033 26.266 35.466 203 anything else, before we lose track of where we are, let's 00 49.733 58.800 01.833 209 or Oak Team? And then part of the idea of a project 00 23.700 214 thing. So if you grab, I would say, take the 00 50.100 218 219 220 two original data sets, and then my final merged. Okay Now 00 16.200 225 them as tabs. And as you generate graphs and 00 36.566 229 230 231 even when I have it in these tabs. Okay, that's really cool. 00 58.833 02.500 236 right, go Oak Team. Well, hi, Dr. Carver, thanks so 00 19.233 240 you would just glance at some of these things, and let me know if 00 32.300 244 we used Graph Builder to look at the price per neighborhood. And 00 45.400 248 help it be a little easier to compare between them. So we kind 00 01.000 252 have a lot of experience with New York City. So we plotted 00 18.166 256 stand in front of the UN and take a picture with all the 00 31.733 260 saying in Gramercy Park or Murray Hill. If we look back at the 00 46.566 265 thought we should expand our search beyond that neighborhood to 00 58.766 269 270 just plotted what the averages were for the neighborhoods but 00 14.533 274 the modeling, and to model the prediction. So if we could put 00 30.766 279 expected price. We started building a model and what we've 00 42.800 283 factors. And so then when we put those factors into just a 00 58.833 287 more, some of the fit statistics you've told us about in class. 00 15.466 292 but mostly it's a cloud around that residual zero line. So 00 30.766 296 which was way bigger than any of our other models. So we know 00 45.800 300 reasons we use real data. Sometimes, this is real. This is 00 58.266 304 looking? Like this is residual values. 00 19.266 309 is good. Ah, cool. Cool. Okay, so I'll look for 00 34.966 313 is sort of how we're answering our few important questions. And 00 47.300 317 was really difficult to clean the data and to join the data. 00 57.866 03.500 322 wanted to demonstrate how JMP in combination with a real world 00 28.700 327 Number one in a real project, scoping is important. We want to 00 47.600 331 hope to bring to the to the group. Pitfall number two, it's vital to explore the 00 08.033 336 the area of linking data combining data from multiple 00 27.800 341 recoding and making sure that linkable 00 45.100 345 346 reproducible research is vital, especially in a team context, especially for projects that may 00 05.966 351 habits of guaranteeing reproducibility. And finally, we hope you notice that in these 00 32.633 356 on the computation and interpretation falls by the 00 51.900 360  
Nascif Neto, Principal Software Developer, SAS Institute (JMP Division) Lisa Grossman, Associate Test Engineer, SAS Institute (JMP division)   The JMP Hover Label extensions introduced in JMP 15 go beyond traditional details-on-demand functionality to enable exciting new possibilities. Until now, hover labels exposed a limited set of information derived from the current graph and the underlying visual element, with limited customization available through the use of label column properties. This presentation shows how the new extensions let users implement not only full hover label content customization but also new exploratory patterns and integration workflows. We will explore the high-level commands that support the effortless visual augmentation of hover labels by means of dynamic data visualization thumbnails, providing the starting point for exploratory workflows known as data drilling or drill down. We will then look into the underlying low-level infrastructure that allows power users to control and refine these new workflows using JMP Scripting Language extension points. We will see examples of "drill out" integrations with external systems as well as how to build an add-in that displays multiple images in a single hover label.     Auto-generated transcript...   Speaker Transcript Nascif Abousalh-Neto Hello and welcome. This is a our JMP discovery presentation from details on demand to wandering workflows, getting to know JMP hover label extensions. Before we start on the gory details, we always like to talk about the purpose of a new feature introduced in JMP. So in this case, we're talking about hover labels extensions. And why do we even have hover labels in the first place. Well, I always like to go back to the visual information seeking mantra from Ben Shneiderman, which is he tried to synthesize overview first, zoom and filter, and then details on demand. Well hover labels are all about details on demand. So let's say I'm looking at this bar chart on this new data set and in JMP, up to JMP 14, as you hover over a particular bar in your bar chart, it's going to pop up a window with a little bit of textual data about what you're seeing here. Right. So you have labeled information, calculated values, just text, very simple. Gives you your details on demand. But what if you could decorate this with visualizations as well. So for example, if you're looking at that aggregated value, you might want to see the distribution of the values that got that particular calculation. Or you might want to see a breakdown of the values behind the that aggregated value. This is what we're gonna let you know with this new visualization, with this new feature. But on top of that, it's the famous, wait, there is more. This new visualization basically allows you to go on and start the visual exploratory workflow. If you click on it, you can open it up in its own window, which allows you to which can also have its visualization, which you can also click and get even more detail. And so you go down that technique called the drill down and eventually, you might get to a point where you're decorating a particular observation with information you're getting from maybe even Wikipedia in that case. Not going to go into a lot of details. We're going to learn a lot about all that pretty soon. But first, I also wanted to talk a little bit about the design decisions behind the implementation of this feature. Because we wanted to have something that was very easy to use that didn't require programming or, you know, lots of time reading the manual and we knew that would satisfy 80% of the use cases. But for those 20% of really advanced use cases or for those customers that know their JSL and they just want to push the envelope on what JMP can do, we also want to make available, something that you could do through programming. But basically, your top of the context of ??? on those visual elements. So we decided to go with architectural pattern called plumbing and porcelain, and that's something we got to git source code control application, which is basically you have a layer that is very rich and because it's very rich, very complex, which gives you access to all that information and allows you to customize things that are going to happen as far as generating the visualization or what happens when you click on that visualization And on top of that, we built a layer that is more limited, its purpose driven, but it's very, very easy to do and requires no coding at all. So that's the porcelain layer. And that's the one that Lisa is going to be talking about now. Up to you. Lisa. I'm going to stop sharing and Lisa is going to take over. Lisa Grossman Okay so we are going to take a high level look at some of the features and what kind of customization system, make the graphic ??? So, let us first go through some of the basics. So by default when you hover over a data point or an element in your graph. you see information displayed for the X and Y roles used in the graph, as well as any drop down roles such as overlay and if you choose to manually manually label a column in the data table, that will also appear as a hover label. So here we have an example of a label, the expression column tha contains an image. And so we can see that image is then populated in hover label in the back. And to add a graphlet to your hover label, you have the option of selecting some predefined graphlet presets, which you can access via the right mouse menu under hover label. Now these presets have dynamic graph role assignments and derive their roles from variables used in your graph. And presets are also preconfigured to be recursive and that will support drilling down. And for preset graphlets that have categorical columns, you can specify which columns to filter by, by using the next in hierarchy column property that's in your data table. And so now I'm going to demo real quick how to make a graphlet preset. So I'm going to bring up our penguins data table that we're going to be using. And I'm going to open up Graph Builder. And I'm going to make a bar chart here. And then right clicking under hover label, you can see that there is a list of different presets to choose from, but we're going to use histogram for this example. So now that we have set our preset, if you hover over a bar, now you can see that there's a histogram preset that pops up in your hover label. And it's also... it is also filtered based on our bar here, which is the island Biscoe. And the great thing about graphlets is if I hover over this bar, I can see another graphlet. And so now you can easily compare these two graphlets to see the distribution of bill lengths for both the islands Dream and Biscoe. And then you can take it a step further and click on the thumbnail of the graphlet and it will launch a Graph Builder instance in its own window and it's totally interactive so you can open up the control panel of Graph Builder and and customize this graph further. And then as you can see, there's a local data filter already applied to this graph, and it is filtered by Biscoe, which is the thumbnail I launched. So, that is how the graphlets are filtered by. And then one last thing is that if I hover over these these histogram bars, you can see that the histogram graphlet continues on, so that shows how these graphlet presets are pre configured to be recursive. So closing these and returning back to our PowerPoint. So I only showed the example of the histogram preset but there are a number that you can go and play with. So these graphlet presets help us answer the question of what is behind an aggregated visual element. So the scatter plot preset shows you the exact values, whereas the histogram, box plot or heat map presets will show you a distribution of your values. And if you wanted to break down your graph and look at your graph with another category, then you might be interested in using a bar, pie, tree map, or a line preset. And if you'd like to examine your raw data of the table, then you can use the tabulate preset. But if you'd like to further customize your graphlet, you do have the option to do so with paste graphlets. And so paste graphlet, you can easily achieve with three easy steps. So you would first build a graph that you want to use as a graphlet. And we do want to note here that it does not have to be one built from Graph Builder. And then from the little red triangle menu, you can save the script of the graph to your clipboard. And then returning to your base graph or top graph, you can right click and under hover label, there will be a paste graphlet option. And that's really all there is to it. And we want to also note that paste graphlet will have static role assignments and will not be recursive since you are creating these graph lets to drill down one level at a time. But if you'd like to create a visualization with multiple drill downs, then you can, you have the option to do so by nesting paste graphlet operations together, starting from the bottom layer going up to your top or base later. So, and this is what we would consider our Russian doll example, and I can demo how you can achieve that. So we'll pull up our penguins data table again. And we'll start with the Graph Builder and we'll we're going to start building our very top layer for this. So let's go ahead build that bar chart. And then let's go on to build our very second...our second layer. So let's do a pie with species. And then for our very last layer, let's do a scatter plot. OK, so now I have all three layers of our...of what we will use to nest and so I will go and save the script of the scatter plot to my clipboard. And then on the pie, I right click and paste graphlet. And so now when you hover, you can see that the scatter plot is in there and it is filtered by the species in this pie. So I'm going to close this just for clarity and now we can go ahead and do the same thing to the pie, save the script, because it already has the scatter plot embedded. So save that to our clipboard, go over to our bar, do the same thing to paste graphlet. And now we have... we have a workflow that is... that you can click and hover over and you can see all three layers that pop up when you're hovering over this bar. So that's how you would do your nested paste graphlets. And so we do want to point out that there are some JMP analytical platforms that already have pre integrated graphlets available. So these platforms include the functional data explorer, process screening, principal components, and multivariate control charts, and process capabilities. And we want to go ahead and quickly show you an example using the principal components. Lost my mouse. There we go. So I launch our table again and open up principal components. And let's do run this analysis. And if I open up the outlier analysis and hover over one of these points, boom, I can see that these graphlets are already embedded into this platform. So we highly suggest that you go and take a look at these platforms and play around with it and see what you like. And so that was a brief overview of some quick customizations you can do with hover label graphlets and I'm going to pass this presentation back to Nascif so he can move you through the plumbing that goes behind all of these features. Nascif Abousalh-Neto Thank you, Lisa. Okay, let's go back to my screen here. And we... I think I'll just go very quickly over her slides and we're back to plumbing, and, oh my god, what is that? This is the ugly stuff that's under the sink. But that's where you have all the tubing and you can make things really rock, and let me show them by giving a quick demo as well. So here Lisa was showing you the the histogram... the hover label presets that you have available, but you can also click here and launch the hover label editor and this is the guy where you have access to your JSL extension points, which is where you make, which is how those visualizations are created. Basically what happens is that when you hover over, JMP is gone to evaluate the JSL block and capture that as an in a thumbnail and put that thumbnail inside your hover label. That's pretty much, in a nutshell, how it goes. And the presets that you also have available here in the hover label, right, they basically are called generators. So if I click here on my preset and I go all the way down, you can see that it's generating the Graph Builder using the histogram element. That's how it does its trick. Click is a script that is gonna react to when you click on that thumbnail, but by default (and usually people stick with the default), if you don't have anything here, it's just, just gonna launch this on its own window, instead of capturing and scale down a little image. In here on the left you can see two other extension points we haven't really talked much about yet. But we will very soon. So I don't want to get ahead of myself. So, So let's talk about those extension points. So we created not just one but three extension points in JMP 15. And they are, they're going to allow you to edit and do different functionality to different areas of your hover label. So textlets, right, so let's say for example you wanted to give a presentation after you do your analysis, but you want to use the result of that analysis and present it to an executive in your company or maybe we've an end customer that wants a little bit more of detail in in a way that they can read, but you would like make that more distinct. So textlet allows you to do that. But since you're interfacing with data, you also want that to be not a fixed block of text, but something that's dynamic that's based on the data you're hovering over. So to define a textlet, you go back to that hover label editor and you can define JSL variables or not. But if you want it to be dynamic, typically, what you do is you define a variable that's going to have the content that you want to display. And then you're going to decorate that value using HTML notation. So, here is how you can select the font, you can select background colors, foreground colors, you can make it italic, and basically make it as pretty or rich of text as you as you need to. Then the next hover labelextension is the one we call gridlet. And if you remember the original or the current JMP hover label, it's basically a grid of name value pairs. To the left, you have names of your...that would be the equivalent to your column name, and to the right, you have the values which might be just a column cell for a particular row if it's a marked plot. But if it's aggregated like a bar chart, this is going to be a mean or an average medium, something like that. The default content from here, like Lisa said before, is derived at both from the...originally is derived both from whatever labeled columns you have in your data table and also, whatever role assignments you have in your graph. So if it's a bar chart, you have your x, you have your y. You might have an overlay variable and everything that in at some point contributes to the creation of that visual element. Well with gridlets you can now have pretty much total control of that little display. You can remove entries. It's very common that sometimes people don't want to see the very first row, which has the labeles or the number of rows. Some people find that redundant. They can take it out. You can add something that is completely under your control. Basically it's going to evaluate the JSL script to figure out what you want to display there. One use case I found was when someone wanted an aggregated value for a column that was not individualization. Some people call those things hidden columns or hidden calculations. Now you can do that, right, and have an aggregation for the same rows that the rest of that that are being displayed on that visualization. You can rename. We usually add the summary statistic to the left of anything that comes from a y calculated column. If you don't like that, now you can remove it or replace it with something else. And as well...and then you can do details like changing the numeric precision or make text bold or italics or red or... even for example, you can make it red and bold, if the value is above a particular threshold. So you can have something that, as I move over here, if the value is over the average of my data I make it red and bold so I can call attention to that. And that will be automatic for you. And finally, graphlets. We believe that's going to be the most useful and used one. Certainly don't want that to cause more attention because you have a whole image inside your tool tip and we've been seeing examples with data visualizations, but it's an image. So it can be a picture as well. It can be something you're downloading from the internet on the fly by making a web call. That's how I got the image of this little penguin. It's coming straight from Wikipedia. As you hover over, we download it, scale it and and put it here. Or you can, for example, that's a very recent use case, someone had a database of pictures in the laboratory and they have pictures of the samples they were analyzing and they didn't want to put them on the data table because the data table would be too large. Well, now you can just get a column, turn that column into a file name, read from the file name, and boom, display that inside your tool tip. So when you're doing your analysis, you know, exactly, exactly what you're looking at. And just like graph...gridlets, we're talking about clickable content. So again, for example, if I wanted and I showed that when I click on this little thumbnail here, I can open a web page. So you can imagine that even as a way to integrate back with your company. Let's say you have web services that they're supported in your company, and you want to, at some point, maybe click on an image to make a call to kind of register or capture some data. Go talking for a web call to that web service. Now that's something you can do. So I like to call, we talk about drill in and drill down, that would be a drill out. That's basically JMP talking to the outside world using data content from your exploration. So let's look at those things in the little bit more detail. So those those visualizations that we see here inside the hover label, they are basically... that's applied to any visualization. Actually it's a combination of a graph destination and the data subset. So in the Graph Builder, for example, you'll say, I want the bar chart of islands by on my x axis and on my y axis, I want to show the average of the body mass of the penguins on that island. Fine. How do you translate that to a graphlet, right? Well, basically when you select the preset or when you write in your code if you want to do it, but the preset is going to is going to use our graph template. So basically, some of the things are going to be predefined like that. The bar element, although if you're writing it your own, you could even say I want to change my visualization depending on my context. That's totally possible. And you're going to fill that template with a graph roles and values and table data, table metadata. So, for example, let's say I have a preset of doing that categorical drill down. I know it's going to be a bar chart. I don't know what a bar chart is going to be, what's going to be on my y or my x axis. That's going to come from the current state of my baseline graph, for example, I'm looking at island. So I know I want to do a bar chart of another category. So that's when the next in hierarchy and the next column comes into play. I'm making that decision on the fly, based on the information that user is giving me and the graph that's being used. For example, if you look here at the histogram, it was a bar chart of island by body mass. This is a histogram of body mass as well. If I come here to the graph and change this column and then I go back and hover, this guy is going to reflect my new choice. That's this idea of getting my context and having a dynamic graph. The other part of the definition of visualization is the data subset. And we have a very similar pattern, right. We have...LDF is local data filter. So that's a feature that we already had in JMP, of course, right. And basically, I have a template that is filled out from my graph roles here. It's like if it was a bar chart, which means my x variable is going to be a grouping variable of island. I know I wanted to have a local data filter of island and that I want to select this particular value so that it matches the value I was hovering over. This happens both when you're creating the hover label and when you're launching the hover label, but when you create a hover label, this is invisible. We basically create a hidden window to capture that window so you'll never see that guy. But when you launch it, the local data filter is there and as Lisa has shown, you can interact with it and even make changes to that so that you can progress your your, your visual exploration on your own terms. So I've been talking about context, a lot. This is actually something that you should need to develop your own graphlets, you need to be familiar with. We call that hover label execution context. You're going to have information about that in our documentation and it's basically if you remember JSL, it's a local block. We've lots of local variables that we defined for you and those those variables capture all kinds of information that might be useful for someone to find in the graphlet or a gridlet or a textlet. It's available for all of those extension points. So typically, they're going to be variables that start with a nonpercent... Not a nonpercent...I'm sorry. To prevent collisions with your data table column names, so it's kinda like reserved names in a way. But basically, you'll see here that that's that's code that comes from one of our precepts. By the way, that code is available to you through the hover label editor, so you can study and see how it goes. Here we're trying to find a new column. To using our new graph, it's that idea of it being dynamic and to be reactive to the context. And this function is going to look into the data table for that metadata. My...a list of measurement columns. So if the baseline is looking at body mass, body mass is going to be here in this value and at a list of my groupings. So if it was a bar chart of island by body mass, we're going to have islands here. So those are lists of column names. And then we also have any of numeric values, anything that's calculated is going to be available to you. Maybe you want to, like I said, maybe you want to make a logical decision based on the value being above or below the threshold so that you can color a particular line red or make it bold, right. You're going to use values that we provide to you. We also provide something that allow you to go back to the data. In fact, to the data table and fetch data by yourself like the row index of the first row on the list of roles that your visual element discovering, that's available to you as well. And then the other even more data, like for example the where clause that corresponds to that local data filter that you're executing in the context of. And the drill depth, let's say, that allows you to keep track of how many times you have gone on that thumbnail and open a new visualization and so on. So for example, when we're talking about recursive visualizations, every recursion needs an exit condition, right. So here, for example, is how you calculate the exit condition of one of your presets. If I don't have anything more to to show, I return empty, means no visualization. Or if I don't have...if I only show you one value, right, or any of my drill depth is greater than one, meaning I was drilling until I got to a point where just only one value to show in some visualizations doesn't make sense. So I can return empty as well. That's just an example of the kinds of decisions that you can make your code using the hover label execution context. Now, I just wanted to kind of gives you a visual representation of how all those things come together again using the preset example. When you're selecting a preset, you're basically selecting the graph template, which is going to have roles that are going to be fulfilled from the graph roles that are in your hover label execution context. And so that's your data, your graph definition. And that date graph definition is going to be combined with the subset of observations resulting from the, the local data filter that was also created for you behind the scenes, based on the visual element you're hovering over. So when you put those things together, you have a hover label, we have a graphlet inside. And if you click on that graphlet, it launches that same definition in here and it makes the, the local data filter feasible as well. When, like Lisa was saying, this is a fully featured life visualization, not just an image, you can make changes to this guy to continue your exploration. So now we're talking, you should think in terms of, okay, now I have a feature that creates visualizations for me and allow me to create one visualization from another. I'm basically creating a visual workflow. And it's kind of like I have a Google Assistant or an Alexa in JMP, in the sense that I can...JMP is making me go faster by creating, doing visualizations on my behalf. And they might be, also they might be not, just an exploration, right. If you're happy with them, they just keep going. If you're not happy with them, you have two choices and maybe it's easier if I just show it to you. So like I was saying, I come here, I select a preset. Let's say I'm going to get a categoric one bar chart. So that gives me a breakdown on the next level. Right. And if I'm happy with that, that's great. Maybe I can launch this guy. Maybe I can learn to, whoops... Maybe I can launch another one for this feature. At the pie charts, they're more colorful. I think they look better in that particular case. But see, now I can even do things like comparing those two bar charts side by side. And let's...but let's say that if I keep doing that and it isn't a busy chart and I keep creating visualizations, I might end up with lots of windows, right. So that's why we created some modifiers to...(you're not supposed to do that, my friend.) You can just click. That's the default action, it will just open another window. If you alt-click, it launches on the previous last window. And if you control-click it launches in place. What do I mean by that? So, I open this window and I launched to this this graphlet and then I launched to this graphlet. So let's say this is Dream and Biscoe and Dream and Biscoe. Now I want to look at Torgersen as well. Right. And I want to open it. But if I just click it opens on its own window. If I alt-click, (Oh, because that's the last one. I hope. I'm sorry. So let me close this one.) Now if I go back here in I alt-click on this guy. See, it replaced the content of the last window I had open. So this way I can still compare with visualizations, which I think it's a very important scenario. It's a very important usage of this kind of visual workflow. Right. But I can kind of keep things under control. And I don't just have to keep opening window after window. And the maximum, the real top window management feature is if I do a control-click because it replaces the window. And then, then it's a really a real drill down. I'm just going on the same window down and down and now it's like okay, but what if I want to come back. Or if you want to come back and just undo. So you can explore with no fear, not going to lose anything. Even better though, even the windows you launch, they have the baseline graph built in on the bottom of the undo stack. So I can come here and do an undo and I go back to the visualizations that were here before. So I can drill down, come back, branch, you can do all kinds of stuff. And let's remember, that was just with one preset. Let's do something kind of crazy here. We've been talking, we've been looking at very simple visualizations. But this whole idea actually works for pretty much any platform in JMP. So let's say I want to do a fit of x by y. And I want to figure out how...now, I'm starting to do real analytics. How those guys fit within the selection of the species. Right. So I have this nice graph here. So I'm going to do that paste graphlet trick and save it to the clipboard. And I'm going to paste it to the graphlet now. So as you can see, we can use that same idea of creating a context and apply that to my, to my analysis as well. And again, I can click on those guys here and it's going to launch the platform. As long as the platform supports local data filters, (I should have given this ???), this approach works as well. So it's for visualizations but in...since in JMP, we have this spectrum where the analytics also have a visual component, so works with our analytics as well. And I also wanted to show here on that drill down. This is my ??? script. So I have the drill down with presets all the way, and I just wanted to go to the the bottom one where I had the one that I decorated with this little cute penguin. But what I wanted to show you is actually back on the hover label editor. Basically what I'm doing here, I'm reading a small JSL library that I created. I'm going to talk about that soon, right, and now I can use this logic to go and fetch visualizations. In this case I'm fetching it from Wikipedia using a web call. And that visualization comes in and is displayed on my visualization. It's a model dialogue. But also my click script is a little bit different. It's not just launching the guy; it's making a call to this web functionality after getting a URL, using that same library as well. So what exactly is it going to do? So when I click on the guy, it opens a web page with a URL derived from data from my visualization and this can be pretty much anything JSL can do. I just want to give us an example of how this also enables you integration with other systems, even outside of JMP. Maybe I want to start a new process. I don't know. All kinds of possibilities. That I apologize. So So there are two customized...advanced customization examples, I should say, that illustrate how you can use graphlets as a an extensible framework. They're both on the JMP Community, you can click here if you get the slides, but one is called the label viewer. I am sorry. And basically what it does is that when you hover over a particular aggregated graph, it finds all the images on the graph...on the data table associated with those rows and creates one image. And that's something customers have asked for a while. I don't want to see just one guy. I want to see if you have more of them, all of them. Or, if possible, right. So when you actually use this extension, and you click on...actually no, I don't have it installed so... And the wiki reader, which was the other one, is the one I just showed to you. Bbut was what I was saying is that when you click and launch this particular...on this particular image, it launches a small application that allows you to page through the different images in your data table and you have a filter that you can control and all that. This is one that was completely done in JSL on top of this framework. So just to close up, what did we learn today? I hope that you found that it's now very easy to add visualizations, you can visualize your visualizations, if you will. It's very easy to add those data visualization extensions using the porcelain features. You actually have not just richer detail on your thumbnails, but you have a new exploratory visual workflow, which you can customize to meet your needs by using either paste graphlet, if you want to have something easy to do, or you can even use JSL using the hover label editor. We're both very curious to see what you've...how you guys are going to use that in the field. So if you come with some interesting examples, please call us back. Send us a screenshot in the JMP Community and let us know. That's all we have today. Thank you very much. And when we give this presentation, we're going to be here for Q&A. So, thank you.  
Jeremy Ash, JMP Analytics Software Tester, JMP   The Model Driven Multivariate Control Chart (MDMVCC) platform enables users to build control charts based on PCA or PLS models. These can be used for fault detection and diagnosis of high dimensional data sets. We demonstrate MDMVCC monitoring of a PLS model using the simulation of a real world industrial chemical process — the Tennessee Eastman Process. During the simulation, quality and process variables are measured as a chemical reactor produces liquid products from gaseous reactants. We demonstrate fault diagnosis in an offline setting. This often involves switching between multivariate control charts, univariate control charts, and diagnostic plots. MDMVCC provides a user-friendly way to move between these plots. Next, we demonstrate how MDMVCC can perform online monitoring by connecting JMP to an external database. Measuring product quality variables often involves a time delay before measurements are available, which can delay fault detection substantially. When MDMVCC monitors a PLS model, the variation of product quality variables is monitored as a function of process variables. Since process variables are often more readily available, this can aide in the early detection of faults. Example Files Download and extract streaming_example.zip.  There is a README file with some additional setup instructions that you will need to perform before following along with the example in the video.  There are also additional fault diagnosis examples provided. Message me on the community if you find any issues or have any questions.       Auto-generated transcript...   Speaker Transcript Jeremy Ash Hello, I'm Jeremy ash. I'm a statistician in jump R amp D. My job primarily consists of testing the multivariate statistics platforms and jump but   I also help research and evaluate methodology and today I'm going to be analyzing the Tennessee Eastman process using some statistical process control methods and jump.   I'm going to be paying particular attention to the model driven multivariate control chart platform, which is a new addition to jump and I'm really excited about this platform and these data provided a new opportunity to showcase some of its features.   First, I'm assuming some knowledge of statistical process control in this talk.   The main thing you need to know about is control charts. If you're not familiar with these. These are charts used to monitor complex industrial systems to determine when they deviate from normal operating conditions.   I'm not gonna have much time to go into the methodology and model driven multivariate control chart. So I'll refer to these other great talks that are freely available.   For more details. I should also mention that Jim finding was that primary developer of the model driven multivariate control chart and in collaboration with Chris Got Walt and Tanya Malden I were testers.   So the focus of this talk will be using multivariate control charts to monitor a real world chemical process.   Another novel aspect of this talk will be using control charts for online process monitoring this means we'll be monitoring data continuously as it's added to a database and texting faults in real time.   So I'm going to start with the obligatory slide on the advantages of multivariate control charts. So why not use University control charts there. There are a number of excellent options and jump.   University control charts are excellent tools for analyzing a few variables at a time. However, quality control data sets are often high dimensional   And the number of charts that you need to look at can quickly become overwhelming. So multivariate control charts summarize a high dimensional process. And just a few charts and that's a key advantage.   But that's not to say that university control charts aren't useful in this setting, you'll see throughout the talk that fault diagnosis often involves switching between multivariate in University of control charts.   Multivariate control charts, give you a sense of the overall health of a process well University control charts allow you to   Look at specific aspects. And so the information is complimentary and one of the main goals of model driven multivariate control chart was to provide some tools that make it easy to switch between those two types of charts.   One disadvantage of the university control chart is that observations can appear to be in control when they're actually out of control in the multivariate sense. So I have to   Control our IR charts for oil and density and these two observations in red are in control, but oil and density are highly correlated. And these observations are outliers in the multivariate sense in particular observation 51 severely violates the correlation structure.   So multivariate control charts can pick up on these types of outliers. When University control charts can't   model driven multivariate control chart uses projection methods to construct its control charts. I'm going to start by explaining PCA because it's easy to build up from there.   PCA reduces dimensionality of your process variables by projecting into a low dimensional space.   This is shown in the in the picture to the right we have p process variables and and observations and we want to reduce the dimensionality of the process to a were a as much less than p and   To do this we use this P loading matrix, which provides the coefficients for linear combinations of our X variables which give the score variables. The shown and equations on the left.   tee times P will give you predicted values for your process variables with the low dimensional representation. And there's some prediction air and your score variables are selected.   In a way that minimizes this squared prediction air. Another way to think about it is, you're maximizing the amount of variance explained x   Pls is more suitable when you have a set of process variables and a set of quality variables and you really want to ensure that the quality variables are kept in control, but these variables are often expensive or time consuming to collect   At planet can be making out of control quality for a long time before fault is detected, so   Pls models allow you to monitor your quality variables as a function of your process variables. And you can see here that pls will find score variables that maximize the variance explained in the y variables.   The process variables are often cheaper and more readily available. So pls models can allow you to detect quality faults early and can make process monitoring cheaper.   So from here on out. I'm just going to focus on pls models because that's that's more appropriate for our example.   So pls partitions your data into two components. The first component is your model component. This gives you the predicted values.   Another way to think about this as your data has been projected into a model plane defined by your score variables and t squared charts will monitor variation in this model plane.   The second component is your error component. This is the distance between your original data and that predicted data and squared prediction error charts are sp charts will monitor   Variation in this component   We also provide an alternative distance to model x plane, this is just a normalized version of sp.   The last concept that's important to understand for the demo is the distinction between historical and current data.   historical data typically collected when the process is known to be in control. These data are used to build the PLS model and define   Normal process variation. And this allows a control limit to be obtained current data are assigned scores based on the model, but are independent of the model.   Another way to think about this is that we have a training and a test set, and the t squared control limit is lower for the training data because we expect lower variability for   Observations used to train the model, whereas there's greater variability and t squared. When the model generalized is to a test set. And fortunately, there's some theory that's been worked out for the   Variants of T square that allows us to obtain control limits based on some distributional assumptions.   In the demo will be monitoring the Tennessee Eastman process. I'm going to present a short introduction to these data.   This is a simulation of a chemical process developed by downs and Bogle to chemists at Eastman Chemical and it was originally written in Fortran, but there are rappers for it in MATLAB and Python now.   The simulation was based on a real industrial process, but it was manipulated to protect proprietary information.   The simulation processes. The, the production of to liquids.   By gassing reactants and F is a byproduct that will need to be siphoned off from the desired product.   The two season processes pervasive in the in the literature on benchmarking multivariate process control methods.   So this is the process diagram. It looks complicated, but it's really not that bad. So I'm going to walk you through it.   The gaseous reactants ad and he are flowing into the reactor here, the reaction occurs and product leaves as a gas. It's been cooled and condensed into a liquid and the condenser.   Then we have a vapor liquid separator that will remove any remaining vapor and recycle it back to the reactor through the compressor and there's also a purge stream here that will   Vent byproduct and an art chemical to prevent it from accumulating and then the liquid product will be pumped through a stripper where the remaining reactants are stripped off and the final purified product leaves here in the exit stream.   The first set of variables that are being monitored are the manipulated variables. These look like bow ties and the diagram.   Think they're actually meant to be valves and the manipulative variables, mostly control the flow rate through different streams of the process.   These variables can be set to specific values within limits and have some Gaussian noise and the manipulative variables can be sampled at any rate, we're using a default three minutes sampling in   Some examples of the manipulative variables are the flow rate of the reactants into the reactor   The flow rate of steam into the stripper.   And the flow of coolant into the reactor   The next set of variables are measurement variables. These are shown as circles in the diagram and they're also sampled in three minute intervals and the difference is that the measurement variables can't be manipulated in the simulation.   Our quality variables will be percent composition of to liquid products you can see   The analyzer measuring the composition here.   These variables are collected with a considerable time delay so   We're looking at the product in the stream because   These variables can be measured more readily than the product leaving in the exit stream. And we'll also be building a pls model to monitor   monitor our quality variables by means of our process variables which have substantial substantially less delay in a faster sampling rate.   Okay, so that's an a background on the data. In total there are 33 process variables into quality variables.   The process of collecting the variables is simulated with a series of differential equations. So this is just a simulation. But you can see that a considerable amount of care went into model modeling. This is a real world process.   So here's an overview of the demo, I'm about to show you will collect data on our process and then store these data in a database.   I wanted to have an example that was easy to share. So I'll be using a sequel light database, but this workflow is relevant to most types of databases.   Most databases support odd see connections once jump connects to the database it can periodically check for new observations and update the jump table as they come in.   And then if we have a model driven multivariate control chart report open with automatic re calc turned on. We have a mechanism for updating the control charts as new data come in.   And the whole process of adding data to a database will likely be going on on a separate computer from the computer doing the monitoring.   So I have two sessions of jump open to emulate this both sessions have their own journal in the materials are provided on the Community.   And the first session will add simulated data to the database and it's called the streaming session and the next session will update reports as they come into the database and I'm calling that the monitoring session.   One thing I really liked about the downs and Vogel paper was that they didn't provide a single metric to evaluate the control of the process. I have a quote from the paper here. We felt like   We felt that the trade offs among possible control strategies and techniques involved, much more than a mathematical expression.   So here's some of the goals they listed in their paper which are relevant to our problem maintain the process variables that desired values minimize variability of the product quality during disturbances and recover quickly and smoothly from disturbances.   So we will assess how well our process achieve these goals, using our monitoring methods.   Okay.   So to start off, I'm in the monitoring session journal and I'll show you our first data sent the data table contains all the variables I introduced earlier, the first set are the measurement variables. The next set our composition variables. And then the last set are the manipulated variables.   And the first script attached here will fit a pls model it excludes the last hundred rose is a test set.   And just as a reminder, this model is predicting our two product composition variables as a function of our process variables but pls model or PLS is not the focus of the talk. So I've already fit the model and output score columns here.   And if we look at the column properties. You can see that there's a MD MCC historical statistics property that contains all the information   On your model that you need to construct the multivariate control charts. One of the reasons why monitoring multivariate control chart was designed this way was   Imagine you're a statistician, and you want to share your model with an engineer, so they can construct control charts. All you need to do is provide the data table with these formula columns. You don't need to share all the gory details of how you fit your model.   So next I will use the score columns to create our control turn   On the left, I have to control charts t squared and SPE there 860 observations that were used to estimate the model. And these are labeled as historical and then I have 100 observations that were held out as a test set.   And you can see in the limit summaries down here that I performed a bond for only correction for multiple testing.   As based on the historical data. I did this up here in the red triangle menu, you can set the alpha level, anything you want and   I did this correction, because the data is known to be a normal operating conditions. So, we expect no observations to be out of control and after this multiplicity adjustment, there are zero false alarms.   On the right or the contribution proportion heat maps. These indicate how much each variable contributes to the outer control signal each observation is on the Y axis and the contributions are expressed as a proportion   And you can see in both of these plots that the contributions are spread pretty evenly across the variables.   And at the bottom. I have a score plant.   Right now we're just plotting the first score dimension versus the second score dimension, but you can look at any combination of the score dimensions using these drop down menus, or this arrow.   Okay, so we're pretty oriented to the report, I'm going to switch over to the monitoring session.   Which will stream data into the database.   In order to do anything for this example, you'll need to have a sequel light odd see driver installed. It's easy to do. You can just follow this link here.   And I don't have time to talk about this but I created the sequel light database. I'll be using and jump I have instructions on how to do this and how to connect jump to the database on my community webpage   This is example might be helpful if you want to try this out on date of your own.   I've already created a connection to this database.   And I've shared the database on the community. So I'm going to take a peek at the data tables in query builder.   I can do that table snapshot   The first data set is the historical data I I've used this to construct a pls model, there are 960 observations that are in control.   The next data table is a monitoring data table this it is just contains the historical data at first, but I'll gradually add new data to this and this is what our multivariate control chart will be monitoring.   And then I've simulated the new data already and added it to this data table here and see it starts at timestamp 961   And there's another 960 observations, but I've introduced a fault at some time point   And I wanted to have something easy to share. So I'm not going to run my simulation script and add the database that way.   I'm just going to take observations from this new data table and move them over to the monitoring data table using some JSON with sequel statements.   And this is just a simple example emulating the process of new data coming into a database, somehow, you might not actually do this with jump. But this is an opportunity to show how you can do it with ASL.   Next, I'll show you the script will use to stream in the data.   This is a simple script. So I'm just going to walk you through it real quick.   The first set of commands will open the new data table from the sequel light database, it opens up in the background. So I have to deal with the window, and then I'm going to take pieces from this new data table and   move them to the monitoring data table I'm calling the pieces bites and the BITE SIZES 20   And then this will create a database connection which will allow me to send the database SQL statements. And then this last bit of code will interactively construct sequel statements that insert new data into the monitoring data. So I'm going to initialize   Okay, and show you the first iteration of this loop.   So this is just a simple   SQL statement insert into statement that inserts the first 20 observations.   Comment that outset runs faster. And there's a wait statement down here. This will just slow down the stream.   So that we have enough time to see the progression of the data and the control charts by didn't have this this streaming example would just be over too quick.   Okay, so I'm going to   Switch back to the monitoring session and show you some scripts that will update the report.   Move this over to the right. So you can see the report and the scripts at the same time.   So,   This read from monitoring data script is a simple script that checks the database every point two seconds and adds new data to the jump table. And since the report has automatic recount turned on.   The report will update whenever new data are added. And I should add that realistically, you probably wouldn't use a script that just integrates like this, you probably use Task Scheduler and windows are automated and Max better schedule schedule the runs   And then the next script here.   will push the report to jump public whenever the report is updated.   I was really excited that this is possible and jump.   It enables any computer with a web browser to view updates to the control chart. You can even view the report on your smartphone. So this makes it easy to share results across organizations. You can also use jump live if you wanted the reports to be on a restricted server.   And then the script will recreate the historical data and the data table in case you want to run the example multiple times.   Okay, so let's run the streaming script.   And look at how the report updates.   You can see the data is in control at first, but then a fault is introduced, there's a large out of control signal, but there's a plant wide control system that's been implemented and the simulation, which brings the system to a new equilibrium   I give this a second to finish.   And now that I've updated the control chart. I'm going to push the results to jump public   On my jump public page I have at first the control chart with the data and control at the beginning.   And this should be updated with the addition of the data.   So if we zoom in on the when the process first went out of control.   Your Jeremy Ash It looks like that was sample 1125 I'm going to color that   And label it.   So that it shows up in other plots and then   In the SP plot it looks like this observation is still in control.   And what chart will catch faults earlier depends on your model. And how many factors, you've chosen   We can also zoom in on   That time point in the contribution plot. And you can see when the process. First goes out of control. There's a large number of variables that are contributing to the out of control signal. But then when the system reaches a new equilibrium, only two variables have large contributions.   So I'm going to remove these heat maps so that I'm more room in the diagnostic section.   And to make everything pretty pretty large so that the text would show up on your screen.   If I hover over the first point that's out of control. You can get a peek at the top 10 contributing variables.   This is great for quickly identifying what variables are contributing the most to the out of control signal. I can also click on that plot and appended to the diagnostic section and   You can see that there's a large number of variables that are contributing to the out of control signal.   zoom in here a little bit.   So if one of the bars is red. This means that variable is out of control.   In a universal control chart. And you can see this by hovering over the bars.   I'm gonna pan, a couple of those   And these graph, let's our IR charts for the individual variables with three sigma control limits.   You'd see for the stripper pressure variable. The observation is out of control in the university control chart, but the variables eventually brought back under control by our control system. And that's true for   Most of the   Large contributing variables and also show you one of the variables where observation is in control.   So once the control system responds many variables are brought back under control and the process reaches   A new equilibrium   But there's obviously a shift in the process. So to identify the variables that are contributing to the shift. And one thing you can look at is a main contribution.   Plot   If I sort this and look at   The variables that are most contributing. It looks like just two variables have large contributions and both of these are measuring the flow rate of react in a in a stream one which is coming into the reactor   And these are measuring essentially the same thing except one is a measurement variable and one's a manipulated variable. And you can see   In the university control chart that there's a large step change in the flow rate.   This one as well. And this is the step change that I programmed in the simulation. So these contributions allow us to quickly identify the root cause.   So I'm going to present a few other alternate methods to identify the same cause of the shift. And the reason is that in real data.   Process shifts are often more subtle and some of the tools may be more useful and identifying them than others and will consistently arrive at the same conclusion with these alternate methods. So it'll show some of the ways that these methods are connected   Down here, I have a score plant which can provide supplementary information about shifts in the t squared plant.   It's more limited in its ability to capture high dimensional shifts, because only two dimensions of the model are visualized at a time, however, we can provide a more intuitive visualization of the process as it visuals visualizes it in a low dimensional representation   And in fact, one of the main reasons why multivariate control charts are split into t squared and SPE in the first place is that it provides enough dimensionality reduction to easily visualize the process and the scatter plot.   So we want to identify the variables that are   Causing the shift. So I'm going to, I'm going to color the points before and after the shift.   So that they show up in the score plot.   Typically, when we look through all combinations of the six factors, but that's a lot of score plots to look through   So something that's very handy is the ability to cycle through all combinations quickly with this arrow down here and we can look through the factor combinations and find one where there's large separation.   And if we wanted to identify where the shift first occurred in the score plots, you can connect the dots and see that the shift occurred around 1125 again.   Another useful tool. If you want to identify   Score dimensions, where an observation shows the largest separation from the historical data and you don't want to look through all the score plots is the normalized score plot. So I'm going to select a point after the shift and look at the normalized score plot.   I'm actually going to choose another one.   Okay. Jeremy Ash Because I want to look at dimensions, five, and six. So the   These plots show the magnitude of the score and each dimension normalized, so that the dimensions are on the same scale. And since the mean of the historical data is is that zero for each score to mention the dimensions with the largest magnitude will show the largest separation.   Between the selected point and the historical data. So it looks like here, the dimensions, five and six show the greatest separation and   I'm going to move to those   So there's large separation here between our   Shifted data and the historical data and square plot visualization is can also be more interpreted well because you can use the variable loadings to assign meaning to the factors.   And   Here I have   We have too many variables to see all the labels for them.   Loading vectors, but you can hover over and see them. And you can see, if I look in the direction of the shift that the two variables that were the cause show up there as well.   We can also explore differences between sub groups in the process with the group comparisons to do that I'll select all the points before the shift in call that the reference group and everything after in call that the group I'm comparing to the reference   These   And this contribution plot will will give me the variables that are contributing the most to the difference between these two groups. And you can see that this also identifies the variables that caused the shift.   The group comparisons tool is particularly useful when there's multiple shifts in a score plot are when you can see more than two distinct subgroups in your data.   In our case, as, as we're comparing a group in our current data to the historical data. We could also just select the data after the shift and look at a main contribution score plot.   And this will give us   The average contributions of each variable to the scores in the orange group. And since large scores indicate large difference from the historical data. These contribution plots can also identify the cause.   These are using the same formula is the contribution formula for t squared. But now we're just using the, the two factors from the score plot.   Okay, I'm gonna find my PowerPoint again.   So real quick, I'm going to summarize the key features of the model driven multi variant control chart that were shown in the demo.   The platform is capable of performing both online fault detection and offline fault diagnosis. There are many methods, providing the platform for drilling down to the root cause of the faults.   I'm showing you. Here's some plots from the popular book fault detection and diagnosis in industrial systems throughout the book authors.   Demonstrate how one needs to use multivariate and universal control charts side by side to get a sense of what's going on in the process.   And one particularly useful feature and model driven multivariate control chart is how interactive and user friendly. It is to switch between these types of charts.   So that's my talk here. Here's my email. If you have any further questions, and thanks to everyone who tuned in to watch this.
Meijian Guan, JMP Research Statistician Developer, SAS   Single-cell RNA-sequencing technology (scRNA-seq) provides a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment. Recently, it has been used to combat COVID-19 by characterizing transcriptional changes in individual immune cells. However, it also poses new challenges in data visualization and analysis due to its high dimensionality, sparsity, and varying heterogeneity across cell populations. JMP Project is a new way to organize data tables, reports, scripts as well as external files. In this presentation, I will show how to create an integrated Basic scRNA-seq workflow using JMP Project that performs standard exploration on a scRNA-seq data set. It first selects a set of high variable genes using a dispersion or a variance-stabilizing transformation (VST) method. Then it further reduces data dimension and sparsity by performing a sparse SVD analysis. It then generates an interactive report that consists of data overview, variable gene plot, hierarchical clustering, feature importance screening, and a dynamic violin plot on individual gene expression levels. In addition, it utilizes the R integration feature in JMP to perform t-SNE or UMAP visualizations on the cell populations if appropriate R packages are installed.     Auto-generated transcript...   Speaker Transcript Meijian Guan All right. Um, hi, everyone. Thank you so much for attending this presentation. I'm so happy that I have this opportunity to share the work I have   have been doing with JMP Life Science group and SAS Institute. So today's topic is going to be building a single-cell RNA-sequencing workflow with JMP Project. So this is a new feature I developed for JMP Genomics 10. If you don't know what is JMP Genomics, I will give you a brief   overview about it and the JMP project is a new feature, released on 14 and it's very nice tool can help you to organize your reports. So we took advantage of this new platform and organized a single-cell RNA-sequencing   workflow into it. So first of all, I just want to give you a little bit background about JMP, JMP Genomics is   one of the products from JMP family is built on top of SAS and JMP Pro. So it's taking advantage of both products which makes a very powerful analytical tool.   So it's designed for genomic data, so it can read in different types of genomic data, it can do preprocessing, it can handle next generation sequencing that analysis.   It is really good at differential gene expression and biomarker discovery, and many scientists using it for crop and livestock breeding. So it's a very powerful tool. I encourage everyone to check it out if you are doing anything related to genomics.   And next thing I want to share with you is the single-cell RNA sequencing. Many of you may not be very familiar with it.   So this is a relatively new technology used to examine that on a level from individual cells.   And comparing to the traditional RNA sequencing technology which is survey, the average expression level of a group of cells.   This, this new technology provides a higher resolution of cellular differences and it gives you a better understanding of the function of the individual cell in the context of its micro environment.   And it can help to do a lot of stuff like uncover new and rare cell populations, track trajectories of cell development, and identify differentially expresed genes between cell types. So it has very wide application.   One application recently is scientists using it to combat Covid 19 so it because it can be used to   characterizing transcriptional changes in immune cells and how to develop the vaccines and treatment. Also in addition to that, it's widely used in cancer research and widely used in immunology and in many other research fields. Um, so it's very powerful tool, but   it does have some challenges to analyze that data, so that's why we put together this workflow.   Just wanted to give you an overview of the top line of the single-cell RNA sequencing. So the first thing you need is to get a sample,   either from human or from animals. It could be a tumor or lab sample. And then you can isolate those samples into individual cells.   So after you isolate, you can do sequencing on every individual cell for all the genes you have. For example, in humans, we have about 30,000 genes. So the final product will look like this in our read count table. We have genes...30,000 genes in rows and we have   about sometimes half million cells as columns. So as you can see, meeting these very large data set has very high dimensions. Also you can notice the zeros in the table because   Because of the technical or biological limitations, there's no way we can detect every single gene in every single cell. So it's not uncommon to see 90% of cells actually are   zeros. So it's very sparse. Sparsity is another challenge when you analyze single cell RNA sequencing data.   But after you do preprocessing, cleaning up, and do dimension reduction, you can apply regular, like clustering and principal components, differential gene expression analysis on this data. So those will be mentioned in my workflow.   And I already mentioned this out that that I noticed challenges, including high dimensionality, high sparsity, and also there are varying heterogeneity across cell populations.   Technical noises and reproducibility, since there are so many different sequencing protocols so many different analytical packages.   In R or Python or other tools, it's very hard for you to follow exact steps to analyze your data. And if you mixed up the steps and didn't do things in correct order,   you may not be able to get a reproducible results. So that's one of the problems that we tried to solve here.   Just want to show you an example of single-cell RNA-sequencing data. This data will be used in my demonstration and it's a reduced blood sample data or ppm. See that I said we have cells in rows and genes in columns so it's   about 8000 columns and 100   rows, which would mean cells. And you can see those zeros, pretty much everywhere. I, I believe it's more than 90% of sparsity in this specific data set.   So what's in our new single-cell RNA-sequencing workflow in JMP Genomics 10? So for this workflow, we tried to build it   for those people who do not have very good technical background or not, do not have time to learn how to code and all those statistics. So in this workflow we put those steps in the right order for users to automatically execute the other steps in the workflow. And we also   provide a very interactive reports to help users navigate with us and change the parameters and check outs different selections.   So what's in this workflow, including data import progress, preprocessing and we have a variable gene selection method, which is the backbone of this workflow actually.   So it for variable gene selection that the goal for this method is to reduce a dimension of the genes.   Because for humans sample, we have 30,000 genes and not all of them are informative. So we try to pick the most informative ones.   dispersion method and   variability stabilizing transformation method based on lowest regression. So, I will not go into the details, but these two methods are widely used the research community and I'm pretty happy that we were able to reproduce them.   And we also apply sparse SVD to further reduce the dimensions. And so we also applied hierarchical clustering and a k means clustering.   We have feature importance screening using a boosting forest method in JMP and if you have R packages installed, we are directly call out to T-SNE and UMAP visualization   which is very popular using a single-cell RNA-sequencing analysis. And we also provided some dynamic visualizations including violin plot, ridgeline plot, dop plot; we also do differential gene expression. And so all the reports will be organized in a very integrated reports with JMP project.   So next I will do a demo. There are two goals in this demo. First one is to classify the cell populations in this PBMC data set. We try to find what are the cell types in this data set. The second goal is just to identify differentially expressed jeans across subtypes and conditions.   So first of all, let's go to JMP Genomics starter. So JMP Genomics interface looks quite different from regular JMP but   it's pretty easy to navigate. If you want to find the workflows basics   and basic single-cell RNA-sequencing workflow lives here, you click that you can bring up this interface. So the interface is pretty intuitive. I'll say you just provide a data set.   And you specify the QC options. What, what kinds of genes or cells do you want to remove for your analysis and variable gene selections. Which one method that you want to use, right. If you select a VST, you can also specify the number of genes you want to keep, 2000 or 3000   And the clustering options, right, how many principal components you want to use for the clustering and   either you wants hierarchical or k means clustering algorithms. And the more options, we have marker genes   to help you to add a list of marker genes you want to use to identify the cell populations, which is very handy tool here. And you can launch ANOVA and differential expression analysis. So this is a separate report.   I will not discuss this in this talk. So another thing we had is experiment example, right. If you add that basically you can provide any information related to start design like treatment information,   sex information. So this is   the simulated data here. I would just want to show you how we're gonna compare the gene expression levels and different measurements between groups.   And finally, we have embedding options which can call out to t-SNE or UMAP R packages if you have them installed. You can change different parameters for this to our algorithms.   So after you specify all those options, just go to run and then you have the report that looks like this one. So this is a   tabular report. There are a total of seven tabs in this report. I organized them in the order that you want to   how many genes in   in the cells, how many read counts or what's the percentage of mitochondria gene counts in your data and the correlations between this three measurements.   And we, you notice this left side, we have the action box. You can expand it and find options in this, in this box you can do many things with it.   In in this tab, specifically, you can split the graph, based on the conditions you provided in the experimental design file. For example, we can do a treatment and we split   Drug1, Drug2, placebo. Then you can see if there's any difference between different groups, right, and we can unsplit if you want to go back to our original plot.   And the second tab is variable gene selection, which is the backbone of this workflow. The red dots mean those genes I selected for subsequent analysis and these   gray dots are the genes that will be discarded in analysis. And if you expand action box, you can see, since we use the VST, we specified 2000 genes   in this analysis. But if you change your mind, you can, you can, whenever you change your mind, you can type in a different number of genes and then click OK. So all the tabs will be refreshed as based on this new number.   So after you have a list of variable genes, what you are going to do is to further reduce the dimensions by performing sparse SVD analysis, which is equivalent to principal component analysis.   So after you apply SVD analysis, you can plot out the top two SVDs or principal components. Try to check the global structure of your data set.   So in this case, we can see there are two big groups in this data set, which is interesting. And also we provide a 3D plot to help you to further explore   your data. Sometimes there's, there are some insights that you cannot, that you cannot identify in a 2D plot; 3D plots sometimes can really provide additional value.   And we have those SVDs, depending on how many you selected (20 or 30), you can use them to   perform clustering. In this case we selected hierarchical clustering and we find nine clusters in your data set.   In addition to dendogram, we also offer a constellation plot   which I really like because this plot is similar to t-SNE or UMAP. It gives you a better idea about the distance between different groups, right.   For example, there are three groups, big groups, three clusters   Kind of distinct from other groups. And if you want to see where are they in the global structure here, this is the top three clusters I detect. Look at 3D plots, we can see   all those highlighted ones are over here. And then go to 2D again, this is one of two big clusters, you know, that I said I highlighted so it's interactivity really help you to   observe and visualize your data in multiple ways. And we also provide a parallel plot to help you to further identify the different patterns across different groups.   And the next tab is embedding, which means t-SNE and UMAP parts if you have R packaging installed. I will   call out to R, run the analysis, and bring back the data and visualize it in JMP. So here is a t-SNE plot. We have nine clusters very nicely being separated.   On the bottom is exactly the same plot but this time I colored them with the marker genes you provided; we have 14 marker genes. So you can, using this feature, switch to click through   to see where these genes are expressed, right. For example, there's a GNLY gene, highly expressed in this little cluster and we are wondering, what's our data? We select and go back and now we see all, most of them are from cluster nine, cluster eight.   So GNLY is a gene for NK cells. This is a marker gene for NK cells. So now we have idea about what a group of this cell is, right.   And also we have action buttons here, help you to do more things. If you want to switch to UMAP, if you prefer that, you can do it. Now the plots associate to UMAP.   Exact same thing but UMAP does give you a little better, a little bit better separation and it can preserve more global structure in the visualization. And also we can provide some ways help you to remove the cells that might be   contaminated or have some quality problems. For example, we don't like a group of cells here, we can remove them from the visualization. Make it cleaner, but you can always bring them back.   And again, we can split the plots, which is a split graph button. This time, we can split by the gender, we split by female, male. Right, we can compare the gene expression level across the gender groups, which is pretty useful sometimes.   And we are split.   And the next tab is providing you more visualization tools to visualize gene expression levels in the nine   groups, nine group cells, right. So the first part is called a violin plot. Again, we have a feature switcher help you to go through   all those genes in different clusters, right. Now you can see, depending on how, how tall the part of the graph is and what its density is.   You can clearly see where those genes are highly expressed. For example, again, we give example I'm using this gene, GNLY, you can see it's highly expressed in cluster eight.   And in the middle, the second plot we are providing you is ridgeline plot. A ridgeline plot is organizing the   clusters on the Y axis and the gene expression level at X axis. But it's basically providing you a similar thing depending on what you like.   For example, GNLY, again, we can see cluster eight have GNLY highly expressed by knot or other clusters.   And the bottom we have another plot called dot plot. This is the new plot we just added to this report. In addition to showing you that gene expression levels, dot plot can also show you the percentage of the cells expressing that gene. For example, take a look at this place,   PPBP gene. And we can see this cluster, in cluster seven we have had, we can see 100% of cells in cluster seven expressing this PPBP gene. So this gene is the marker gene for   PPBP cells actually. So now it's very clear that cluster seven is is one type of blood cell, which is a PPBP cell and they take a look at other, like for example, cluster two.   There's only 12% of the cells expressing this gene. So there might be some contamination but this group of cells is definitely not PPBP cells. So this plot just showing you   the expression level and expression percentage in each cluster which offers additional information in the plot.   So next tab is also very useful. It's called feature screening. What I did was a fading boosting forest algorithm and then using that genes to predict the clusters. So the most important genes which contributed to the separation of the cells are ranked in this table.   And the correct way to to view these genes is to open this action box. You select this top maybe top 35 genes you want to visualize. You click OK.   So the next tab will show you only 35 genes you selected. So those are the genes, mostly informative. Right. They, they can explain   why those different groups of cells are separated. So again, we just switch or you can click through and try to see the patterns. And then if you notice, there lot of genes LYZ,   CST3 and NKG7. Those are already in the marker genes or were provided, which means this feature screening   method is really successful to pick up those most important genes in your data set. And another thing, you can do visualization is through GTEx database. The GTEx is a tissue-specific database.   Tell you what genes expressed in which tissue in your human body. So we can directly send the gene list to the database. You just click OK. We will open the   website and GTEx website will provide you a heat map, right, so with the top 35 genes. Now you can see where I've expressed in those two human tissues, organs, which is very convenient to see additional information.   So now we've those marker genes being used, you probably can identify what group of cells are they, right. So there's one   function here is called a recode. What it does is, you open it, now you can recode those numbers into actual cell names, right. For example, eight we already know it's NK, it's NK cells. And we can do...I already have names for every single one of them. So I just type in   Those   Monocite.   Two is DC cells actually; three is FCGR3A+ monocite.   These are Naive CD+ T cells.   Group five Memory CD4+ T cells.   And CD8+ T   Meijian Guan Meijian Guan for group six; seven is PPBP, as we already saw and the ninth is B cells.   So with those recode we click recode. Now since all the plots and the tabs are connected, now you can find all the numbers have changed into actual cell names. So it's just help you to explore your data in   easily, right. You can know where what those cells are and you can again do some exploration on and in your plots. And again, including this clustering plots, you know, see the custom name has been changed into the actual cell names. Um, so   That's it for today's topic. And if you have any questions, you can send me an email or leave a message on the JMP Community. Thank you so much for your time.
John Cromer, Sr. Research Statistician Developer, JMP   While the value of a good visualization in summarizing research results is difficult to overstate, selection of the right medium for sharing with colleagues, industry peers and the greater community is equally important. In this presentation, we will walk through the spectrum of formats used for disseminating data, results and visualizations, and discuss the benefits and limitations of each. A brief overview of JMP Live features sets the stage for an exciting array of potential applications. We will demonstrate how to publish JMP graphics to JMP Live using the rich interactive interface and scripting methods, providing examples and guidance for choosing the best approach. The presentation culminates with a showcase of a custom JMP Live publishing interface for JMP Clinical results, including the considerations made in designing the dialog, the mechanics of the publishing framework, the structure of JMP Live reports and their relationship to the JMP Clinical client reports and a discussion of potential consumption patterns for published reviews.     Auto-generated transcript...   Speaker Transcript John Cromer Hello everyone, Today I'd like to talk about two powerful products that extend JMP in exciting ways. One of them, JMP Clinical, offers rich visualization, analytical and data management capabilities for ensuring clinical trial safety and efficacy. The other, JMP Live, extends these visualizations to a secure and convenient platform that allows for a wider group of users to interact with them from a web browser. As data analysis and visualization becomes increasingly collaborative, it is important that both creating and sharing is easy. By the end of this talk, you'll see just how easy it is. First, I'd like to introduce the term collaborative visualization. Isenberg, et al., defines it as the shared use of computer supported interactive visual representations of data on more than one person with a common goal of contribution to join information processing activities. As I'll later demonstrate, this definition captures the essence of what JMP, JMP Clinical and JMP Live can provide. When thinking about the various situations in which collaborative visualization occurs, it is useful to consult the Space Time Matrix. In the upper left of this matrix, we have the traditional model of classroom learning and office meetings, with all participants at the same place at the same time. Next in the upper right, we have participants at different places interacting with the visualization at the same time. In the lower left, we have participants interacting at different times at the same location, such as in the case of shift workers. And finally, in the lower right, we have flexibility in both space and time with participants potentially located anywhere around the globe and interacting with the visualization at any time of day. So JMP Live can facilitate this scenario. A second way to slice through the modes of collaborative visualization is by thinking about the necessary level of engagement for participants. When simply browsing a few high-level graphs or tables, sometimes simple viewing can be sufficient. But with more complex graphics and for those in which the data connections have been preserved between the graphs and underlying data tables, users can greatly benefit by also having the ability to interact with and explore the data. This may include choosing a different column of interest, selecting different levels in a data filter and exposing detailed data point hover text. Finally, authors who create visualizations often have a need to share them with others and by necessity will also have the ability to view, interact with and explore the data. and JMP and JMP Clinical for authors who require all abilities. A third way to think about formats and solutions is by the interactivity spectrum. Static reports, such as PDFs, are perhaps the simplest and most portable, but generally, the least interactive Interactive HTML, also known as HTML5, offers responsive graphics and hover text. JMP Live is built on an HTML5 foundation, but also offer server-side computations for regenerating the analysis. While the features of JMP Live will continue to grow over time, JMP offers even more interactivity. And finally, There are industry-specific solutions such as JMP Clinical which are built on a front framework of JMP and SAS that offer all of JMP's interactivity, but with some additional specialization. So when we lay these out on the interactivity spectrum, we can see that JMP Live fills the sweet spot of being portable enough for those with only a web browser to access, while offering many of the prime interactive features that JMP provides So the product that I'll use to demonstrate creating a visualization is JMP Clinical. JMP Clinical, as I mentioned before, offers a way to conveniently assess clinical trial safety and efficacy. With several role-based workflows for medical monitors, writers, clinical operations and data managers, and three review templates, predefined or custom workflows can be conveniently reused on multiple studies, producing results that allow for easy exploration of trends and outliers. Several formats are available for sharing these results, from static reports and in-product review viewer and new to JMP Clinical ??? and JMP Live reports. The product I'll use to demonstrate interacting with on a shared platform is JMP Live. JMP Live allows users with only a web browser to securely and conveniently interact with the visualizations, and they could specify access restrictions for who can view both the graphics and the underlying data tables with the ability to publish a local data filter and column switcher. The view can be refreshed in just a matter of seconds. Users can additionally organize their web reports through titles, descriptions and thumbnails and leave comments that facilitate discussion between all interested parties. So explore the data on your desktop with JMP or JMP Clinical, published a JMP Live with just a few quick steps, share the results with colleagues across your organization, and enrich the shared experience through communication and automation. So now I would like to demonstrate how to publish a simple graphic from JMP to JMP Live. I'm going to open the demographics data set from the sample study Nicardipine, which is included with JMP Clinical. I can do this either through the file open menu where I can navigate to my data set dt= open then the path to my data table. So I'm going to click run scripts to open that data table. Okay. So now I'd like to create a simple visualization. I'm going to, let's say, I'd like to create a simple box plot. Or click graph, Graph Builder. And here I have a dialogue from moving variables into roles. I'm going to move the study site identifier into the X role. Age into Y. And click box plot. And click Done. So here's one quick and easy way to create a visualization in JMP. Alternatively, I can do the same thing with the script. And so this block of code I have here, this encapsulates a data filter and a Graph Builder box plot into a data filter context box. So I'm going to run this block of code. And here you see, I have some filters and a box plot. Now, notice how interactive this filter is and the corresponding graph. I can select a different lower bound for age; I can type in a precise value, let's say, I'd like to exclude those under 30 and suppose I am interested in only the first 10 study side identifiers. OK. So now I'd like to share this visualization with some of my colleagues who don't have JMP but they have JMP Live. So one way to publish this to JMP Live is interactively through the file published menu. And here I have options for for my web report. Can see I have options for specifying a title, description. I can add images. I can choose who to share this report with. So at this point, I could publish this, but I'd like to show you how to do so using the script. So I have this chunk of code where I create a new web report object. I add my JMP report to the web report object. I issue the public message to the web report, and then I automatically open the URL. So let me go ahead and run that. You can see that I'm automatically taken to JMP Live with a very similar structure as my client report. My filter selections have been preserved. I can make filter selection changes. For example, I can move the lower bound for age down and notice also I have detailed data point hover text. I have filter-specific options. And I also have platform-specific options. So any time you see these menus. You can further explore those to see what options are available. Alright, so now that you've seen how to publish a simple graphic from JMP to JMP Live. How about a complex one, as in the case of a JMP Clinical report. So what I'm going to do is open a new review. I will add the adverse events distribution report to this review. I will run it with all default settings. And now I have my adverse events distribution report, which consists of column switchers for demographic grouping and stalking, report filters, an adverse events counts graph, tabulate object for counts and some distributions. Suppose I'm interested in stacking my adverse events by severity. I've selected that and now I have my stoplight colors that I've set for my adverse events for mild, moderate and severe. At this point I'm...I'd like to share these results with a colleague who maybe in this case has JMP, but there are certain times where they prefer to work through a web browser to to inspect and take a look at the visualizations. So this point, I will click this report level create live report button. I will... ...and that...and now I have my dialogue, I can choose to publish to either file or JMP Live. I can choose whether to publish the data tables or not, but I would always recommend to publish them for maximum interactivity. I can also specify whether to allow my colleagues to download the data tables from JMP Live. In addition to the URL, you can specify whether to share the results only with yourself, everyone at your organization or with specific groups. So for demonstration purposes, I will only publish for myself. I'll click OK. Got a notification to say that my web report has been published. Over on JMP Live, I have a very similar structure. At my report filters, my column switchers with my column, a column of interest preserved. You can see my axes and legends and colors have also carried over. Within this web report, I can easily collapse or expand particular report sections, and many of the sections off also offer detailed data point hover text and responsive updates for data filter changes. Another thing I'd like to point out is this Details button in the upper right of the live report, where I can get detailed creation information, a list of the data tables that republished, as well as the script. And because I've given users the ability to download these tables and scripts, these are download buttons for those for that purpose. I can also leave comments from my colleagues that they can then read and take further action on, for example, to follow up on an analysis. All right, so from my final demo, I would simply like to extend my single clinical report to a review consisting of two other reports enrollment patterns, and findings bubble plot. So I'm going to run these reports. Enrollment patterns plots patient enrollment over the course of a study by things like start date of disposition event, study day and study site identifier. Findings bubble plot, I will run on the laboratory test results domain. And this report features a prominent animated bubble plot, in which you can launch this animation. You can see how specific test results change over the course of a study. You can pause the animation. You can scroll to specific, precise values for study day and you can also hover over data points to reveal the detailed information for each of those points. create live report for review. I have a...have the same dialogue that you've seen earlier, same options, and I'm just going to go ahead and publish this now so you can see what it looks like when I have three clinical reports bundled together and in one publication. So when this operation completes, you will see that will be taken to an index page corresponding to report sections. And each thumbnail on this page corresponds to report section in which we have our binoculars icon on the lower left, that indicates how many views each page had. I have a three dot menu, where you can get back to that details view. If you click Edit, from here you can also see creation information and a list of data tables and scripts. And by clicking any of these thumbnails, I can get down to the report, the specific web report of interest. So just because this is one of my favorite interactive features, I've chosen to show you the findings bubble plot on JMP Live. Notice that it has carried over our study day, where we left off on the client, on study day 7. I can continue this animation. You can see study day counting up and you can see how our test results change over time. I can pause this again. I can get to a specific study day. I can do things like change bubble size to suit your preference. Again, I have data point hover text, I can select multiple data points and I have numerous platform specific options that will vary, but I encourage you to take a look at these anytime you see this three dot menu. So to wrap up, let me just jump to my second-last slide. So how was all this possible? Well, behind the scenes, the code to publish a complex clinical report is simply a JSL script that systematically analyzes a list of graphical report object references and pairs them with the appropriate data filters, column switchers, and report sections into a web report object. The JSL publish command takes care of a lot of the work for you, for bundling the appropriate data tables into the web report and ensuring that the desired visibility is met. Power users who have both products can use the download features on JMP Live to conveniently share to conveniently adjust the changes ...to to... make changes on their clients and to update their... the report that was initially published, even if they were not the original authors. And then the cycle can continue, of collaboration between those on the client and those on JMP Live. So, as you can see, both creating and sharing is easy. With JMP and JMP Clinical, collaborative visualization is truly possible. I hope you've enjoyed this presentation, and I look forward to any questions that you may have.  
Mike Anderson, SAS Institute, SAS Institute Anna Morris, Lead Environmental Educator, Vermont Institute of Natural Science Bren Lundborg, Wildlife Keeper, Vermont Institute of Natural Science   Since 1994, the Vermont Institute of Natural Science’s (VINS) Center for Wild Bird Rehabilitation (CWBR), has been working to rehabilitate native wild birds in the northeastern United States. One of the most common raptor patients CWBR treats is the Barred Owl. Barred Owls are fairly ubiquitous east of the Rocky Mountains. Their call is the familiar “Who cooks for you, who cooks for you all.” They have adapted swiftly to living alongside people and, because of this, are commonly presented to CWBR for treatment. As part of a collaboration with SAS, technical staff from JMP and VINS have been analyzing the admission records from the rehabilitation center. Recently we have used a combination of Functional Data Analysis, Bootstrap Forest Modeling, and other techniques to explore how climate and weather patterns can affect the number of Barred Owls that arrive at VINS for treatment — specifically for malnutrition and related ailments. We found that a combination of temperature and precipitation patterns results in an increase in undernourished Barred Owls being presented for treatment. This session will discuss our findings, how we developed them, and potential implications in the broader context of climate change in the Northeastern United States.       Auto-generated transcript...   Speaker Transcript Mike Anderson Welcome, everyone, and thank you for joining us. My name is Anna Morris and I'm the lead environmental Educator at the Vermont Institute of Natural Science or VINS in Quechee, Vermont. I'm Bren Lundborg, wildlife keeper at VINS center for wildlife rehabilitation and I'm Mike Anderson JMP systems engineer at SAS. We're excited to present to you today our work on the effects of local weather patterns on the malnutrition and death rates of wild barred owls in Vermont. This study represents 18 years of data collected on wild owls presented for care at one avian rehabilitation clinics and unique collaboration between our organization and the volunteer efforts of Mike Anderson at JMP. Let's first get to know our study species, the barred owl, with the help of a non releasable rehabilitative bird serving as an education ambassador at the VINS nature center. Yep, this owl was presented for rehabilitation in Troy, New Hampshire in 2013 and suffered eye damage from a car collision from which she was unable to recover. Barred owls like this one are year-round residents of the mixed deciduous forests of New England, subsisting on a diet that includes mammals, birds, reptiles, amphibians, fish and a variety of terrestrial and aquatic invertebrates. However, the prey they consume differs seasonally, with small mammals composing a larger portion of the diet in the winter. Their hunting styles differ in winter as well, due to the presence of snowpack, which can shelter small mammals from predation. Barred owls are known to use the behavior of snow punching or pouncing downward through layers of snow to catch prey detected auditorially. Here's a short video demonstrating this snow punching behavior. I've seen in that quick clip barred owls can be quite tolerant of human altered landscapes, with nearly one quarter of barred owl nests utilizing human structures. There are also the most frequently observed owl species by members of the public in Vermont, according to the citizen science project, iNaturalist, with 468 research grade observations of wild owls logged. As such, barred owls are commonly presented to wildlife rehabilitation clinics by people who discover injured animals. The Vermont Institute of Natural Sciences Center for wild bird rehabilitation or CWBR is the federal and state licensed wildlife rehabilitation facility located in Quechee, Vermont. All wild living avian species that are legal to rehabilitate in the state are submitted as patients to CWBR and we received an average of 405 patients yearly from 2001 to 2019, representing 193 bird species. 90% of patients presented at CWBR come from within 86 kilometers of the facility. Of the patients admitted during the 18 year period of the study, 11% were owls of the order Strix varia, comprising six species, with barred owls being the most common. However, year to year, the number of barred owls received as patients by CWBR has varied widely compared to another commonly received species, the American Robin. Certain years, such as the winter of 2018 to 2019 had been anecdotally considered big barred owl years by CWBR staff and other rehabilitation centers in the Northeastern US for the large number presented as patients. One explanation proposed by local naturalists attributes the big year phenomenon to shifts in weather patterns. When freeze/thaw cycles occur over short time scales, these milder, wetter winters are thought to pose challenges to barred owls relying on snow plunging for prey capture. Specifically the formation of a layer of ice on top of the snow can prevent owls from capturing prey using this snow plunging technique as the owls may not be able to penetrate this ice layer. In order to feed...I lost my place. In order to feed the animals may therefore use alternative hunting locations or styles or suffer from weakness due to malnutrition, which could lead to adverse interactions with humans, resulting in injury. This study was undertaken to determine if a relationship exists between higher than average winter precipitation and the number of barred owls presented during those years at CWBR for rehabilitation. Though there are several possible explanations for the variation in the number of patients associated with regional weather, we sought to determine if there was support for the ice layer hypothesis by further investigating whether barred owls presented during wetter winters exhibited malnutrition as part of the intake diagnosis in greater proportion than in dryer winters. This would suggest that obtaining food was a primary difficulty, leading to the need for rehabilitation, rather than a general population increase, which would likely lead to a proportional increase in all intake categories. Initially we expected that there would be a fairly simple time series analysis relationship to this. We went and looked at the original data for the admissions and just to compare as, as Bren said, just to compare the data between the barred owls and the American robins, you can see for bad years, which I've marked here in blue, except for the the gray one which is actually had a hurricane involved, we can see there's a very strong periodic signal associated with the robins. We can see that the year-round resident barred owls should have something resembling a fairly steady intake rate, but we see some significant changes in that year to year. Looking at the contingency analysis, we can see that the green bands, the starvation, correlates fairly nicely with those years where we have big barred owl years. Again, pointing out 2008, 2015, 2019, these being ski season years instead, which I'll make clear in a moment. 2017 doesn't show up, but it does have a big band of unknown trauma and cause, and that was from a difference in how they were triaging the incoming animals that year. The one, the one trick to working with this is that we needed to use functional data analysis to be able to take the year over year trends and turn them into a signal that we can analyze effectively against weather patterns and other data that we were able to find. Looking here, it's fairly easy to see that those years that we would call bad years have a very distinctive dogear...dogear...dogleg type pattern. You can see 2008, 2017, 2019, 2015. Again, Most importantly, those signals tend to correlate most strongly with this first Eigen function in our principal component analysis. You can see quite clearly here that component one does a great job with discriminating between the good years and the bad years with that odd hurricane year right in the middle where it should be. You can also look at the profiler for FPC one and you can see that as we raise and lower that profiler, we see that dogleg pattern become more pronounced. The next question is, is how do we get the data for that kind of an analysis? How do we get the weather data that we think is important? Well, it turns out that there's a great organization that's a ski resort about 20 miles away from here that has been collecting data from as far back as the 50s. And they've also been working with naturalists and conservation efforts, providing their ecological or their environmental data to researchers for different projects, and they gave us access to their database. This is an example of base mountain temperature at Killington, Vermont, and you can see that the the bad years, again colored in blue here, tend to have a flatter belly in their low temperature. You can see for instance, looking at 2007, the first one in the upper left corner, you can see that there's a steep drop down, followed by a steep incline back up. Whereas 2008, which is one of the bad years for for owl admissions, we have a fairly flat, and if not maybe in a slightly inverted peak in the middle. And that's fairly consistent, with the exception of maybe 2015, throughout the other throughout the other the other years. So I took all of that data and used functional data explorer to get the principal components for our responses. We end up having, therefore, a functional component on the response and a functional component on the factors. This is an example of one of those for the...what turns out to be one of the driving factors of this analysis, and you can see it does a very nice job of pulling out the principal components. The one we're going to be interested in in a moment is this Eigenfunction4. It doesn't look like much right now, but it turns out to be quite important. So let's put all this together. I use the combination of generalized regression, along with the autovalidation strategy that was pioneered by Gotwalt and Ramsay a few years ago to build a model of this of the behavior. We can see we get a fairly good actual by predictive plot for that. We get a nice r square around 99% and looking at the reduced model, we see that we have four primary drivers, the cumulative rain that shows up. That makes sense. We can't have rain without...we can't have ice without rain. Also a temperature factor, we need temperature to have a strong...to have ice. But also we have the sum of the daily snowfall or the daily snowfall. That's a max total snowfall per year, and the sum of the daily...the daily rainfall as well. And taking all of this, we can put together start to put together a picture of what bad barred owl years look like from a data driven standpoint. We can see fairly clearly. I'm going to show you first again what a bad barred...what a bad year looks like from the standpoint of the of the of the the admission rates. And we can see here. Let me show you what a bad, bad year looks like. That's a bad year; that's a good year, fairly dramatic difference. Now we're going to have to pay fairly close attention to the... We're gonna have to pay fairly close attention to the other factors to see because it's a very subtle change in the the temperatures, in the rain falls that trigger this good year/bad year. It's it's kind of interesting how how tiny the effects are. So first, this is the total snowfall per year. And we're going to pay attention to the slope of this curve for a good year and then for a bad year. Fairly tiny change, year over year. So it's a it's a subtle change, but that subtle change is one of the big drivers. We need to have a certain amount of snowfall present in order to facilitate the snow diving. The other thing, if we look at rain, we're going to look at the belly of this rainfall right here, around, around week 13 in the in the ski season. There's a good year. And there's a bad year. Slightly more rain earlier in the year, and with a flatter profile going into spring. And again, looking at the cumulative rain over the season, a good year tends to be a little bit drier than a bad year. And lastly, most importantly, the temperature. This one is actually fairly...this is that belly effect that we were seeing before. We see in early years or in good years that we have that strong decline down and strong climb out in the temperature, but for bad years we get just slightly more bowlshaped effect overall. And I'm going to turn it over to Bren to talk about what that means in terms of barred owl malnutrition. Malnutrition has a significant negative impact upon survival of both free ranging owls and those receiving treatment at a rehabilitation facility. Detrimental effects include reduced hunting success, lessened ability to compete with other animals or predator species for food, and reduced immunocompetence. Some emaciated birds are found too weak to fly and are at high risk for complications such as refeeding syndrome during care. For birds in care, the stress of captivity, as well as healing from injuries such as fractures and traumatic brain injuries can double the caloric needs of the patient, thus putting further metabolic stress on an already malnourished bird. Additionally, scarcity or unavailability of food may push owls closer to human populated areas, leading to increased risk for human related causes of mortality. Vehicle strikes are the most common cause of intakes for barred owls in all years and hunting near roads and human occupied habitats increases that risk. In the winter of 2018 to 2019, reports of barred owls hunting at bird feeders and stalking domestic animals, such as poultry, were common. Hunting at bird feeders potentially increases exposure to pathogens, as they are common sites of disease transmission, it may lead to higher rates of infectious diseases such as salmonellosis and trichomoniasis. Difficult winters also provide extra challenges for first year barred owls. Clutching barred owls are highly dependent on their parents and will remain with them long after being able to fly and hunt. And once parental support ends, they are still relatively inexperienced hunters facing less prey availability and harsher conditions in their first winter. Additionally, the lack of established territories may lead them to be more likely to hunt near humans, predisposing them to risks such as vehicle collision related injuries. Previous research on a close relative of the barred owl, the northern spotted owl of the Pacific Northwest, shows a decline in northern spotted owl in fecundity and survival associated with cold, wet weather in winter and early spring. In Vermont, the National Oceanic and Atmospheric Administration has projected an increase in winter precipitation of up to 15% by the middle of the 21st century, which may have specific impacts on populations of barred owls and their prey sources. The findings of this study provide important implications for the management of barred owl populations and those of related species in the wake of a change in climate. Predicted changes to regional weather patterns in Vermont and New England forecast that cases of malnourished barred owls will only increase in frequency over the next 20 to 30 years as we continue to see unusually wet winters. Barred owls, currently listed by the International Union for Conservation of Nature as a species of least concern with a population trend that is increasing, will likely not find themselves threatened with extinction rapidly. However, ignoring this clear threat to local populations may cascade through the species at large and exacerbate the effects of other conservation concerns, such as accidental poisoning and nest site loss. These findings also highlight the need for protocols to be established on the part of wildlife rehabilitators and veterinarians for the treatment of severe malnourishment in barred owls, such as to avoid refeeding syndrome, and provide the right balance of nutrients for recovery from an often lethal condition. Rehabilitation clinics would benefit from a pooling of knowledge and resources to combat this growing issue. Finally, this study shows yet another way in which climate change is currently affecting the health of wildlife species around us. Individual and community efforts to reduce human impacts on the climate will not be sufficient to reduce greenhouse gas emissions at the scale necessary to halt or reverse the damage that has been done. Action on the part of governments and large corporations must be taken, and individuals and communities have the responsibility to continue to demand that action. We would like to thank the staff and volunteers at the Vermont Institute of Natural Science, as well as at JMP, who helped collect and analyze the data presented here, especially Gray O'Tool. We'd also like to thank the Killington Ski Resort for providing us with the detailed weather data. Thank you.  
Roland Jones, Senior Reliability Engineer, Amazon Lab126 Larry George, Engineer who does statistics, Independent Consultant Charles Chen SAE MBB, Quality Manager, Applied Materials Mason Chen, Student, Stanford University OHS Patrick Giuliano, Senior Quality Engineer, Abbott Structural Heart   The novel coronavirus pandemic is undoubtedly the most significant global health challenge of our time. Analysis of infection and mortality data from the pandemic provides an excellent example of working with real-world, imperfect data in a system with feedback that alters its own parameters as it progresses (as society changes its behavior to limit the outbreak). With a tool as powerful as JMP it is tempting to throw the data into the tool and let it do the work. However, using knowledge of what is physically happening during the outbreak allows us to see what features of the data come from its imperfections, and avoid the expense and complication of over-analyzing them. Also, understanding of the physical system allows us to select appropriate data representation, and results in a surprisingly simple way (OLS linear regression in the ‘Fit Y by X’ platform) to predict the spread of the disease with reasonable accuracy. In a similar way, we can split the data into phases to provide context for them by plotting Fitted Quantiles versus Time in Fit Y by X from Nonparametric density plots. More complex analysis is required to tease out other aspects beyond its spread, answering questions like "How long will I live if I get sick?" and "How long will I be sick if I don’t die?". For this analysis, actuarial rate estimates provide transition probabilities for Markov chain approximation to SIR models of Susceptible to Removed (quarantine, shelter etc.), Infected to Death, and Infected to Cured transitions. Survival Function models drive logistics, resource allocation, and age-related demographic changes. Predicting disease progression is surprisingly simple. Answering questions about the nature of the outbreak is considerably more complex. In both cases we make the analysis as simple as possible, but no simpler.     Auto-generated transcript...   Speaker Transcript Roland Jones Hi, my name is Roland Jones. I work for Amazon Lab 126 is a reliability engineer.   When myself and my team   put together our abstracts for the proposal at the beginning of May, we were concerned that COVID 19 would be old news by October.   At the time of recording on the 21st of August, this is far from the case. I really hope that by the time you watch this in October, there will...things will be under control and life will be returning to normal, but I suspect that it won't.   With all the power of JMP, it is tempting to throw the data into the tool and see what comes out. The COVID 19 pandemic is an excellent case study   of why this should not be done. The complications of incomplete and sometimes manipulated data, changing environments, changing behavior, and changing knowledge and information, these make it particularly dangerous to just throw the data into the tool and see what happens.   Get to know what's going on in the underlying system. Once the system's understood, the effects of the factors that I've listed can be taken into account.   Allowing the modeling and analysis to be appropriate for what is really happening in the system, avoiding analyzing or being distracted by the imperfections in the data.   It also makes the analysis simpler. The overriding theme of this presentation is to keep things as simple as possible, but no simpler.   There are some areas towards the end of the presentation that are far from simple, but even here, we're still working to keep things as simple as possible.   We started by looking at the outbreak in South Korea. It had a high early infection rate and was a trustworthy and transparent data source.   Incidentally, all the data in the presentation comes from the Johns Hopkins database as it stood on the 21st of August when this presentation was recorded.   This is a difficult data set to fit a trend line to.   We know that disease naturally grows exponentially. So let's try something exponential.   As you can see, this is not a good fit. And it's difficult to see how any function could fit the whole dataset.   Something that looks like an exponential is visible here in the first 40 days. So let's just fit to that section.   There is a good exponential fit. Roland Jones What we can do is partition the data into different phases and fit functions to each phase separately.   1, 2, 3, 4 and 5.   Partitions were chosen where the curve seem to transition to a different kind of behavior.   Parameters in the fit function were optimized for us in JMP' non linear fit tool. Details of how to use this tool are in the appendix.   Nonlinear also produced the root mean square error results, the sigma of the residuals.   So for the first phase, we fitted an exponential; second phase was logarithmic; third phase was linear; fourth phase, another logarithmic; fifth phase, another linear.   You can see that we have a good fit for each phase, the root main square error is impressively low. However, as partition points were specifically chosen where the curve change behavior, low root mean square area is to be expected.   The trend lines have negligible predictive ability because the partition points were chosen by looking at existing data. This can be seen in the data present since the analysis, which was performed on the 19th of June.   Where extra data is available, we could choose different partition points and get a better fit, but this will not help us to predict beyond the new data.   Partition points do show where the outbreak behavior changes, but this could be seen before the analysis was performed.   Also no indication is given as to why the different phases have a different fit function.   This exercise does illustrate the difficulty of modeling the outbreak, but does not give us much useful information on what is happening or where the outbreak is heading. We need something simpler.   We're dealing with a system that contains self learning.   As we as society, as a society, learn more about the disease, we modify behavior to limited spread, changing the outbreak trajectory.   Let's look into the mechanics of what's driving the outbreak, starting with the numbers themselves and working backwards to see what is driving them.   The news is full of COVID 19 numbers, the USA hits 5 million infections and 150,000 deaths. California has higher infections than New York. Daily infections in the US could top 100,000 per day.   Individual numbers are not that helpful.   Graphs help to put the numbers into context.   The right graphs help us to see what is happening in the system.   Disease grows exponentially. One person infects two, who infect four, who infect eight.   Human eyes differentiate poorly between different kinds of curves but they differentiate well between curves and straight lines. Plotting on a log scale changes the exponential growth and exponentially decline into straight lines.   Also on the log scale early data is now visible where it was not visible on the linear scale. Many countries show one, sometimes two plateaus, which were not visible   in the linear graph. So you can see here for South Korea, there's one plateau, two plateaus and, more recently, it's beginning to grow for third time.   How can we model this kind of behavior?   Let's keep digging.   The slope on the log infections graph is the percentage growth.   Plotting percentage growth gives us more useful information.   Percentage growth helps to highlight where things changed.   If you look at the decline in the US numbers, the orange line here, you can see that the decline started to slacken off sometime in mid April and can be seen to be reversing here in mid June.   This is visible but it's not as clear in the infection graphs. It's much easier to see them in the percentage growth graph.   Many countries show a linear decline in percentage growth when plotted on a log scale. Italy is a particularly fine example of this.   But it can also be seen clearly in China,   in South Korea,   and in Russia, and also to a lesser extent in many other countries.   Why is this happening?   Intuitively, I expect that when behavior changes, growth would drop down to a lower percent and stay there, not exponentially decline toward zero.   I started plotting graphs on COVID 19 back in late February, not to predict the outbreak, but because I was frustrated by the graphs that were being published.   After seeing this linear decline in percentage growth, I started paying an interest in prediction.   Extrapolating that percentage growth line through linear regression actually works pretty well as a predictor, but it only works when the growth is declining. It does not work at all well when the growth is increasing.   Again, going back to the US orange line, if we extrapolate from this small section here, where it's increasing which is from the middle of June to the end...to the beginning of July,   we can predict that we will see 30% increase by around the 22nd of July, that will go up to 100% weekly growth by the 20th...26th of August, and it will keep on growing from there, up and up and up and up.   Clearly, this model does not match reality.   I will come back to this exponential decline in percentage growth later. For now, let's keep looking at the, at what is physically going on as the disease spreads.   People progress from being susceptible to disease to being infected to being contagious   to being symptomatic to being noncontagious to being recovered.   This is the Markoff SIR model. SIR stands for susceptible, infected, recovered. The three extra stages of contagious, symptomatic and noncontagious helped us to model the disease spread and related to what we can actually measure.   Note the difference between infected and contagious. Infected means you have the disease; contagious means that you can spread it to others. It's easy to confuse the two, but they are different and will be used in different ways, further into this analysis.   The timing shown are best estimates and can vary greatly. Infected to symptomatic can be from three to 14 days and for some infected people,   they're never symptomatic.   The only data that we have access to is confirmed infections, which usually come from test results, which usually follow from being symptomatic.   Even if testing is performed on non symptomatic people, there's about a five-day delay from being infected to having a positive test results.   So we're always looking at all data. We can never directly observe observe the true number of people infected.   So the disease progresses through individual individuals from top to bottom in this diagram.   We have a pool of people that are contagious and that pool is fed by people that are newly infected becoming contagious and the pool is drained by people that are contagious becoming non contagious.   The disease spreads spreads to the population from left to right.   New infections are created when susceptible people come into contact with contagious people and become infected.   The newly infected people join the queue waiting to become contagious and the cycle continues.   This cycle is controlled by transmission.   How likely a contagious person is to infect a susceptible person per day.   the number of people that a contagious person is likely to infect while they are contagious.   This whole cycle revolves around the number of people contagious and the transmission or reproduction.   The time individuals stay contagious should be relatively constant unless COVID 19 starts to mutate.   The transmission can vary dramatically depending on social behavior and the size of the susceptible population.   Our best estimate is the days contagious averages out at about nine.   So we can estimate people contagious as the number of people confirmed infected in the last nine days.   In some respects, this is an underestimate because it doesn't include people that are infected, but not yet symptomatic or that are asymptomatic or that don't yet have a positive test result.   In other respects, it's an overestimate because it includes includes people who were infected, a long time ago, but they're only now being tested as positive. It's an estimate.   From the estimate of people contagious, we can derive the percentage growth in contagious. It doesn't matter if the people contagious is an overestimate or underestimate.   As long as the percentage error in the estimate remains constant, the percentage growth in contagious will be accurate.   Percentage growth in contagious, because within use it to derive transmission,   The derivation of this equation relating the two can be found in the appendix.   Know this equation allows you to derive transmission and then reproduction from the percentage growth in contagious, but it cannot tell you the percentage growth in contagious for a given transmission.   This can only be found by solving numerically.   I have outlined outlined how to do this using JMP's fit model tool in the appendix.   Reproduction and transmission are very closely linked, but reproduction has the advanced...advantage of ease of understanding.   If it is greater than one, the outbreak is expanding out of control. Infections will continue to grow and there will be no end in sight.   If it is less than one, the outbreak is contracting, coming under control. There are still new infections, but their number will gradually decline until they hit zero. The end is in sight, though it may be a long way off.   The number of people contagious is the underlying engine that drives the outbreak.   People contagious grows and declines exponentially. We can predict the path of the outbreak by extrapolating this growth or decline in people contagious. Here we have done it for Russia and Italy and for China.   Remember the interesting observation from earlier, the infections percent in growth percentage growth declines exponentially and here's why.   If reproduction is less than one and constant, people contagious will decline exponentially towards zero.   People contagious drives the outbreak.   The percentage growth in infections is proportional to the number of people contagious. So if people contagious declines exponentially, but percentage growth and infections will also decline exponentially. Mystery solved.   The slope of people contagious plotted on log scale gives us the contagious percentage growth, which then gives us transmission and reproduction through the equations on the last slide.   Notice that there's a weekly cycle in the data. This is particularly visible in Brazil, but it's also visible in other countries as well.   This may be due to numbers getting reported differently at the weekends or by people being more likely to get infected at the weekend. Either way, we'll have to take this seasonality into account when using people contagious to predict the outbreak.   Because social behavior is constantly changing, transmission and reproduction changes as well. So we can't use the whole distribution to generate reproduction.   We chose 17 days as the period over which to estimate reproduction. We found that one week was a little too short to filter out all of the noise, two weeks gave a better results, two and a half weeks was even better. Having the extra half week   evened out the seasonality that we saw in the data.   There is a time series forecast tool in JMP that will do all of this for us, including the seasonality, but because we're performing the regression on small sections of the data, we didn't find the tool helpful.   Here is the derived transmission and reproduction numbers.   You can see that they can change quickly.   It is easy to get confused by these numbers. South Korea is showing a significant increase in reproduction, but it's doing well. The US, Brazil, India and South Africa are doing poorly, but seem to have a reproduction of around one or less.   This is a little confusing.   To help reduce the confusion around reproduction, here's a little bit of calculus.   Driving a car, the gas pedal controls acceleration.   To predict where the car is going to be, you need to know where you are, how fast you're traveling and how much you're accelerating or decelerating.   In a similar way to know where the pandemic is going to be, we need to know how many infections there are, which is the equivalent of distance traveled. We need to know how fast the infections are expanding or how many people are contagious, both of which are the equivalent of speed.   We need to know how fast the people contagious is growing, which is a transmission or reproduction, which is the equivalent of acceleration.   There is a slight difference. Distance grows linearly with speed and speed grows linearly with acceleration.   Infections do grow linearly with people contagious, but people contagious grows exponentially with reproduction.   There is a slight difference, but the principle's the same.   The US, Brazil, India and South Africa have all traveled a long distance. They have high infections and they're traveling at high speed. They have high contagious. Even a little bit of acceleration has a very big effect on the number of infections.   South Korea, on the other hand, on the other hand is not going fast, it has low contagious. So has the headroom to respond to the blip in acceleration and get things back under control without covering much distance   Also, when the number of people contagious is low, adding a small number of new contagious people produces a significant acceleration. Countries that have things under control are prone to these blips in reproduction.   You have to take all three factors into account   (number of infections, people contagious and reproduction) to decide if a country is doing well or doing poorly.   Within JMP there are a couple of ways to perform the regression to get the percentage growth of contagious. There's the Fit Y by X tool and there's the nonlinear tool. I have details on how to use both these tools in the appendix. But let's compare the results they produce.   The graphs shown compare the results from both tools. The 17 data points used to make the prediction are shown in red.   The prediction line from both tools are just about identical, though there are some noticeable differences in the confidence lines.   The confidence lines for the non linear, tool are much better. The Fit Y by X tool transposes that data into linear space before finding the best fit straight line.   This results in a lower cost...in the lower conference line pulling closer to the prediction line after transposing back into the original space.   Confidence lines are not that useful when parameters that define the outbreak are constantly changing. Best case, they will help you to see when the parameters have definitely changed.   In my scripts, I use linear regression calculated in column formulas, because it's easy to adjust with variables. This allows the analysis to be adjusted on the fly without having to pull up the tool in JMP.   I don't currently use the confidence lines in my analysis. So I'm working on a way to integrate them into the column formulas.   Linear regression is simpler and produces almost identical results. Once again, keep it simple.   We have seen how fitting an exponential to the number of people contagious can be used to predict whether people contagious will be in the future, and also to derive transmission.   Now that we have a prediction line for people contagious, we need to convert that back into infections.   Remember new infections equals people contagious and multiplied by transmission.   Transmission is the probability that a contagious person will infected susceptible person per day.   The predicted graphs that results from this calculation are shown. Note that South Korea and Italy have low infections growth.   However, they have a high reproduction extrapolated from the last 17 days worth of data. So, South Korea here and Italy here, low growth, but you can see them taking off because of that high reproduction number.   The infections growth becomes significance between two and eight weeks after the prediction is made.   For South Korea, this is unlikely to happen because they're moving slowly and have the headroom to get things back under control.   South Korea has had several of these blips as it opens up and always manages to get things back under control.   In the predicted growth percent graph on the right, note how the increasing percentage growth in South Korea and this leads will not carry on increasing indefinitely, but they plateau out after a while.   Percentage growth is still seen to decline exponentially, but it does not grow exponentially.   It plateaus out.   So to summarize,   the number of people contagious is what drives the outbreak.   This metric is not normally reported, but it's close to the number of new infections over a fixed period of time.   New infections in the past week is the closest regular reported proxy, the number of people contagious. This is what we should be focusing on, not the number of infections or the number of daily new infections.   Exponential regression of people contagious will predict where the contagious numbers are likely to be in the future.   Percentage growth in contagious gives us transmission and reproduction.   The contagious number and transmission number can be combined to predict the number of new infections in the future.   That prediction method assumes the transmission and reproduction are constant, which they aren't. They change their behavior.   But the predictions are still useful to show what will happen if behavior does not change or how much behavior has to change to avoid certain milestones.   The only way to close this gap is to come up with a way to mathematically model human behavior.   If any of you know how to do this, please get in touch. We can make a lot of money, though only for short amount of time.   This is the modeling. Let's check how accurate it is by looking at historical data from the US.   As mentioned, the prediction works well when reproduction's constant but not when it's changing.   If we take a prediction based on data from late April to early May, it's accurate as long as the prediction number stays at around the same level of 1.0   The reproduction number stays around 1.0.   After the reproduction number starts rising, you can see that the prediction underestimates the number of infections.   The prediction based on data from late June to mid July when reproduction was at its peak as states were beginning to close down again,   that prediction overestimates the infections as reproduction comes down.   The model is good at predicting what will happen if behavior stays the same but not when behavior is changing.   How can we predict deaths?   It should be possible to estimate the delay between infection and death.   And the proportion of infections that result in deaths and then use this to predict deaths.   However, changes in behavior such as increasing testing and tracking skews the number of infections detected.   So to avoid this skew also feeding into the predictions for deaths, we can use the exact same mathematics on deaths that we used on infections. As with infections, the deaths graph shows accurate predictions when deaths reproduction is stable.   Note that contagious and reproduction numbers for deaths don't represent anything real.   This method works because because deaths follow infections and so follow the same trends and the same mathematics. Once again, keep it simple.   We have already seen that the model assumes constant reproduction. It also does not take into account herd immunity.   We are fitting an exponential, but the outbreak really follows the binomial distribution.   Binomial and a fitted exponential differ by less than 2% with up to 5% of the population infected. Graphs demonstrating this are in the appendix.   When more than 5% of the population is no longer susceptible due the previous infection or to vaccination, transmission and reproduction naturally decline.   So predictions based on recent reproduction numbers will still be accurate, however long-term predictions based on an old reproduction number with significantly less herd immunity will overestimate the number of infections.   On the 21st of August, the US had per capita infections of 1.7%   If only 34% of infected people have been diagnosed   as infected, and there is data that indicates that this is likely, we are already at the 5% level where herd immunity begins to have a measurable effect.   At 5% it reduces reproduction by about 2%.   What the model can show us, reproduction tells us whether the outbreak is expanding. It's greater than 1, which is the equivalent of accelerating or its contracting, it's less than 1, the equivalent of decelerating.   Estimated number of people contagious tells us how bad the outbreak is, how fast we're traveling.   Per capita contagious is the right metric to choose appropriate social restrictions.   The recommendations for social restrictions though listed on this slide are adapted from those published by the Harvard Global Health Institute. There's a reference in the appendix.   What they recommend is when there's less than 12 people contagious per million, test and trace is sufficient. When we get up to 125 contagious per millio, rigorous test and trace is required   At 320 contagious per million, we need rigorous test and trace and some stay at home restrictions.   Greater than 320 contagious per million, stay at home restrictions are necessary.   At the time of writing, the US had 1,290 contagious per million, down from 1,860 at the peak in late July.   It's instructional to look at the per capita contagious in various countries and states when they decided to reopen.   China and South Korea had just a handful of people contagious per million.   Europe has in the 10s of people contagious per million except for Italy.   The US had hundreds of people contagious per million when they decided to reopen.   We should not really have reopened in May. This was an emotional decision not a data-driven decision.   Some more specifics about the US reopening.   As I said, the per capita contagious in the US, at the time of writing was 1,290 per million.   1,290 per million, with a reproduction of .94.   With this per capita contagious and reproduction, it will take until the ninth of December to get below 320 contatious per million.   The lowest reproduction during the April lockdown was .86.
PATRICK GIULIANO, Senior Quality Engineer, Abbott Charles Chen, Continuous Improvement Expert, Statistics-Aided Engineering (SAE), Applied Materials Mason Chen, High School Student, Stanford Online High School   Cooked foods such as dumplings are typically prepared without precise process control on cooking parameters. The objective of this work is to customize the cooking process for dumplings based on various dumpling product types. During the cooking process in dumpling preparation, the temperature of the water and time duration of cooking are the most important factors in determining the degree to which dumplings are cooked (doneness). Dumpling weight, dumpling type, and batch size are also variables that impact the cooking process. We built a structured JMP DSD platform with special properties to build a predictive model on cooking duration. Internationally recognized ISO 22000 Food Safety Management and the Hazard Analysis Critical Control Point (HACCP) schemas were adopted. JMP Neural Fit techniques using modern data mining algorithms were compared to RSM. Results demonstrated the prevalence of larger main effects from factors such as: boiling temperature, product type, dumpling size/batch as well as interaction effects that were constrained by the mixture used in the dumpling composition. JMP Robust Design Optimization, Monte Carlo Simulation and HACCP Control Limits were employed in this design/analysis approach to understand and characterize the sensitivity of dumpling cooking factors on the resulting cooking duration. The holistic approach showed the synergistic benefit of combining models with different projective properties, where recursive partition-based AI models estimate interaction effects using classification schema and classical (Stepwise) regression modeling provides the capability to interpret interactions of 2nd order, and higher, including potential curvature in quadratic terms. This paper has demonstrated a novel automated dumpling cooking process and analysis framework which may improve process throughout, lower the cost of energy, and reduce the cost of labor (using AI schema). This novel methodology has the potential to reshape thinking on business cost estimation and profit modeling in the food-service industry.     Auto-generated transcript...   Speaker Transcript Patrick Giuliano All right. Well, welcome everyone.   Thank you all for taking the time to watch this presentation.   Preparing the Freshest Steamed Dumplings.   And my name is Patrick Giuliano and my co authors are Mason Chen and Charles Chen from Applied Materials, as well as Yvanny Chang.   Okay, so today I'm going to tell you about how I harnessed...my team and I harness the power of JMP to really understand about   dumpling cooking.   And so that the general problem statement here is that most foods like dumplings are made without precise control of cooking parameters.   And the taste of a dumpling, and as well as other outputs to measure how good a dumpling is, is adversely affected by improper cooking time and this is intuitive to everyone who's enjoyed food so   we needn't to talk too much about that. But   Sooner or later AI and robotics will be an important part of the food industry.   And our recent experience with Covid 19 has really highlighted that and and so I'm going to talk a little bit about how we   can understand the dumpling process better using a very multi faceted modeling approach, which uses many of JMP's modeling capabilities, including robust Monte Carlo design optimization.   So why dumplings?   Well dumplings are very easy to cook.   And by cooking them, of course, we kill any foreign particles that may be living on them.   And cooking can involve very limited human interaction.   So of course with that, the design and the process space related to cooking is very intuitive and extendable   and we can consider the physics associated with this problem and try to use JMP to help us really understand and investigate the physics better.   AI is really coming sooner or later because of Covid 19, of course, and   why would robotic cooking of dumplings be coming? Well   And also other questions might be, what are the benefits? What are the challenges of cooking dumplings in an automated way in a robotic setting?   And of course, this could be a challenge because actually robots don't have the nose to smell. And so because of that, that's a big reason why, in addition to   an advanced and multifaceted modeling approach, it's important to consider some other structured criteria.   And later in this presentation, I'm going to talk a little bit, a little bit about the HACCP criteria and how we integrated that in order to solve our problem in a more structured way.   Okay, so   before I dive into a lot of the interesting JMP analysis, I'd like to briefly provide an introduction into heat transfer physics, food science and how different heat transfer mechanisms affect the cooking of dumplings.   So as you can see in this slide, there's a   Q   at the top of the diagram and the upper right and that Q is referred to...it refers to the heat flux density, which is the amount of energy that   flows through a unit area per unit time, and the direction of temperature change.   From the point of view of physics, proteins and raw and boiled meat differ in their amounts of energy. An activation energy barrier has to be overcome in order to turn raw meat protein structure into a denatured or compactified structure as shown here.   in this picture at the left.   So the first task of the cook, when boiling meat in terms of physics, is to increase the temperature throughout the volume of the piece   at least   To reach the temperature of the denaturation.   Later, I'm going to talk about the most interesting finding of this particular phase of the experiment where we discovered that there was a temperature cut off.   And and intuitively, you would think that below a certain temperature dumplings would be cooked...wouldn't be cooked properly, they would be to soggy, and above a certain temperature, perhaps they would also be too soggy or they may be burned or crusty.   One other final note about the physics here is that at the threshold for boiling, the surface temperature of the water fluctuates and bubbles will rise to the surface of the boiler   and break apart and collapse and that can make it difficult to gather...capture and...excuse me...to capture accurate readings of temperature.   So that leads us into some...what are some of the tools that we used to conduct an experiment?   Well,   of course, we used a boiling cooker and that's very important.   Of course, we used a something to measure temperature and for this we used an infrared thermometer and we used a timer, of course, and we used a mass balance to weigh the dumpling and all the constituents going into the dumpling.   We might consider something called Gare R&R in future studies and where we may quantify the repeatability and reproducibility of our of our measurement tools.   In this experiment, we didn't, but that is very important, because this helps us maximize the precision of our model estimates by minute...minimizing the noise components associated with our measurement process.   And those noise components could not only be   a fact...a factor of say that accuracy tolerance for the gauge, but they they could also be due to how the person interacts with the with the measurement itself.   And, and, in particular, I'm going to talk a little bit about the challenge with measuring boiling and cooking temperature at high at high temperature.   Okay so briefly,   this...we set this experiment up as a designed experiment. And so we had to decide on the tools first.   We had to decide on how we would make the dumpling. So we needed a manufacturing process and appropriate   role players in that process. And then we had to design a structured experiment. And to do that we use the definitive screening design   and looked at some characteristics of the design to ensure that the design was performing optimally for our experiment.   Next we executed the experiment.   And then   we measured and recorded the response.   And of course, finally, the fun part,   we got to effectively interpret the data and JMP.   And these are graphs that the right here that are showing scatter plot matrices generated in JMP, just using the graph function.   And these actually give us an indication of the uniformity that prediction space. I'll talk a little bit of more more about that later...then next...in the coming slides.   Okay, so here's our data. Here's our data collection plan and at the right is the response that we measured, which is a dumpling rising time or cooking time.   We collected 18 runs in a DSD, which we generated in JMP using the DSD platform and in under the DOE menu.   And we collected information on the mass of the meat, the mass of the veggies going in, the mass of the...the type of the meat, rather, the composition of the vegetables (being either cabbage or mushroom),   and of course the total weight, the sizes of the batch that we cooked, the number of dumplings per batch, and the water temperature.   So this slide just highlights some of the the amazing power of a DSD and and I won't go into this too much, but DSDs are very lauded for their flexible and powerful modeling characteristics.   And they allow the great potential for conducting Blitz screening and optimise optimization in a single experiment.   This chart at the at the right is a correlation matrix generated in JMP, in its designed diagnostics platform of the DOE menu and it and it's it's particularly powerful for showing   the extent of aliasing or confounding among all the factor effects in your model. And what this graphic shows clearly is that   by the darkest blue, there's no correlation and, as the as the correlation increases and we get it to shades of gray, and then finally, as we get to very positive correlation, we get to shades of red. So what we're really seeing is that   main effects are completely uncorrelated with each other, which is what we like to see in two factor interactions, the main effects are uncorrelated.   With with quadratic effects which is up in this right quadrant. And then the quadratic effects are actually only partially correlated with each other and then you have these higher order interaction terms,   which are which are really partially correlated with interaction effects and these types of characteristics make this particular design superior and to the typical Resolution III and Resolution IV fractional factorial designs that that we used to be taught   before the DSDs.   Okay. So just quickly discussing a little bit about the design diagnostics. Here's the same correlation...a similar correlation plot, except the factors have actually been given their particular names   and after in...in running this DSD design. And this is just a gray and white version of of a correlation matrix to help you see the extent of orthoginality or   not, not being ??? among the factors. And so what you can see in our experiment is that we actually did observe a little bit of   confounding among a batch size a batch size and meat unsurprisingly, and then, of course, meat with the interaction between meat and the vegetables that are in the in the dumpling.   And note that we imposed one design constraint here, which we did observe some confounding with, which is the very intuitive constraint in that the dumplings...than the total mass of the dumpling is the composition of each of the components of the dumpling itself.   So,   the other...   so why, why are we doing this?   Why are we assessing this quote unquote uniformity here, this the scatter plot matrix here and and what is this kind of telling us? Well,   in order to maximize prediction capability throughout the space of the of the predictor, in rising time in this case, we want to find the combinations of the factors that minimize the white areas. Okay. And in the white areas are where the prediction accuracy is thought to be weaker.   And this is a and this is why we take the design and we put it into a scatter plot matrix and this is analogous to sort of the homogeneity of error assumption in ANOVA,   where you know we look for a space, the space of prediction to be equally probable, or the equal variance assumption in linear regression.   When we want we want this space to be equally probable across the range of the predictors.   So in in this in this experiment, of course, in order to reduce the number of factors that we're looking at, first we used engineering and our understanding of the engineering and the physics of the problem.   And so for identification, we identified six independent variables or variables that were least confounded with each other and and and we proceeded and with the analysis on the basis of these primary variables.   Okay.   So the first thing we did is we we took our generated design and we use stepwise regression to try to simplify the model, identify only the active factors in the model.   And here you can use a combination of forward selection, backwards selection, and mixed as your stopping criteria in order to determine the model that explains the most variation in your response.   And also, I can model meat type as discrete and or numeric or...rather discrete numeric, and in this way I can use this labeling to make the factor coding to correspond to   the meat type being the shrimp or the pork, which we used.   So what kind of a stopping rule can you use in the framework of this type of a regression model? Well,   unfortunately, when I ran this model again and again, I wasn't really able to reproduce it exactly. And model reproduction can be somewhat inconsistent, since the fit...this type of a fitting schema involves a computational procedure to iterate to a solution.   And so therefore, in this stepwise regression, the overfit risk is is typically higher.   And oftentimes if there's any curvature in the model   or there are two factor interactions, for example, the, the explanatory variants across...is shared across both of those factors where you can't tease apart that variability associated with one or the other.   And so what we can see here clearly, based on the adjusted R squared, is that we're getting a very good fit, and probably a fit that's too good.   Meaning that we can't predict in the future based on them on the fit to this particular model.   Okay.   So here's where it gets pretty interesting. So   one of the things that we did first off, after running the stepwise is that we assigned independent uniform inputs to each of the factors in the model.   And this is a sort of Monte Carlo implementation in JMP.   A different kind of Monte Carlo implementation and and and   it's a what's what's what's important to understand in this particular framework is that the difference between the main effect and and the total effect can indicate the extent of interaction.   hat that this the extent of interaction associated with a particular factor in the model. And so this is showing that in particular, water temperature and meat,   in addition to being most explanatory in terms of total effect, may likely interact with other factors in this particular model.   And what what you see, of course, is that we identified water temperature, meat, and and and the meat type as our top predictors, using the Paredo plot for transformed estimates.   The other thing I'd like to highlight here before I launch into some of the other slides is the sensitivity indicate indicator that we can invoke here and   under the profiler after we assign independent uniform inputs,   we can colorize the profile profiler to indicate the strength of the relationship between each of the input input factors and the and the response. And we can also use   the sensitivity indicator, which is represented by these purple triangles, to show us the sensitivity or the you can say the strength of the relationship   similar to the linear regression coefficient would indicate the strength, where the taller the triangle and the steeper the relationship,   the stronger either in the positive or the negative direction and the wider and flatter the triangle, the less strong that relationship that's that factor plays.   Okay.   So we went about reducing our model and using some engineering sense and using the stepwise platform.   And what we get   is a this is a just a snapshot of our model fit from the originating from the DSD design and it has RSM structured as curvature. And you can see this is an interaction plot which shows the extent of interaction among all the factors in a in a pairwise way.   And we've indicated where some of the interactions are present and what those interactions look like.   So this is a model that really, we can get more of a handle on   Okay, so   I think one other thing to mention is that the design constraint that we imposed in is is similar to what you might consider a mixture design, where all the components add together and the constraint has to sum to 100%.   Okay, so here's just a high level picture of the profiler and we we can adjust or modulate each of the input factors and then   observe the, the impact on the response and and we did this in a very manual way   just to get gain some intuition into how the model is performing   And of course to optimize our cooking time what we confirmed was that the time has to be faster, of course, the variants associated with the cooking time should be lower.   And the throughput the throughput and the power savings should be optimized, maximized. And those are two additional   responses that we derived based on cooking time.   Okay, so here's where we get into the optimization more fully into that optimization   of the of the cooking process. And so as I mentioned before, we designed or we created two additional response variables that are connected to the physics, where we have maximum throughput and that depends on in how many dumpling...   I'm sorry, depends on how many dumplings, but also weight and time.   And power savings, which is the which is the product of the power consumed and the time for cooking, which is an energy component.   And so in order to engage in this optimization process,   we need to introduce error associated with each of the input factors and that's represented by these distributions at the bottom here.   And and we also need to consider that the practical HACCP control window and of course measurement device capability, which is something that we would like to look at in future studies.   And so here's just a, a nice picture of the HACCP control plan that we use and this is follows very similar to something like a failure modes and effect analysis in the quality profession. And it's just a structured approach to   experimentation or manufacturing process development and where key variables are identified,   and key owners and then what criteria are being measured against and how that criteria being validated. And so HACCP is actually common in the food science industry and it stands for Hazard Analysis Critical Control Point monitoring.   And I think   in addition to all of these preparation activities, mainly I was involved in the context of this experiment as a data collector and and data integrity is a very important thing. And so   transcribing data appropriately is is definitely is definitely important.   So all the HACCP control points need to be incorporated into the Monte Carlo simulation range and ultimately the HACCP tolerance range can be used to derive the process performance requirement.   Okay, so   we consider a range of inputs where Monte Carlo can confirm that this expected range is practical for the cooking time.   We want to consider a small change in each of the input factors at each at each level each at each HACCP component level and   and this is determined by the control point range. Based on the control point range, we can determine the delta x and the delta in each of the inputs from the delta y response time.   And we can continue to increase the delta x incrementally iteratively,   while hoping that the the increase is small enough so that the change in y is small enough to meet the specification. And usually in industry that's a design tolerance and in this case, it's our HACCP control point control parameter range or control parameter limit.   And if if that iterative procedure fails, then we have to make the increment and X smaller and we call this procedure tolerance allocation. Okay.   We did this sort of manually and using sort of as our own special recipe. Although this can be done in a more automated way in JMP and   and in this case, you can see we have more all of our responses. So using multiple response optimization, as well as which would involve using invoking the desirability functions and maximizing the desirability under the prediction profiler   as well as a Gaussian process modeling,   also available under the prediction profiler.   Okay. So next in the vein of, you know, using tools to complement each other and try to understand further understand our product...our process and our experiment, we use the, the neural modeling capability under the analyze menu, under the prediction modeling tools.   And   We, we tried to utilize it to facilitate our prediction.   This model uses a TanH function, which can be more powerful to detect curvature in non linear effects,   but it's sort of a black box and it doesn't really tie back to the physics. So   while it's also robust to non normal responses and somewhat robust to aliasing confounding.   it, it has its limitations, particularly with a small sample size, such as that we have here, and you can actually see that the r squared between the training and validation sets are not particularly   the same or they vary so this model isn't particularly consistent for the purposes of prediction.   Finally, we used the the partition platform in JMP to run recursive partitioning on our   response time response.   And and this model is is definitely relatively easy to interpret in general, but I think particularly for our experiment because we can see that for the rising time we have this temperature cut off at about 85 degrees C,   and that and as well as some temperature differentiation with respect to maximum throughput, but in particular is 85 degrees cut off is most...is very interesting.   The R squared note in this model is about .7, at least with respect to the rising temp response, which is pretty good for this type of a model, considering the small sample size.   And   what's most interesting with respect the to this cut off is that below 85 C, the water really wasn't boiling. There wasn't   much bubbling, no turbulence. The reading was very stable. However, as we increased the temperature, the water started to circulate, turbulence in the water caused non uniform temperature, cavitation bubble collapse, steam rising, and it's basically an unstable temperature environment.   In this type of environment convection dominates rather than conduction.   And steam also blocks light of the infrared thermometer, which also increases increases the uncertainty associated with the temperature measurement.   And and the steam itself presents a burn risk which, in addition to safety, it may impact how the operator adjusts the thermometer and puts the the adjust the distance in which the operator places the thermometer, which is very important for accuracy of measurement.   So, and this, in fact, was why we capped our design at 95 C because it was really impossible to measure water temperature accurately any more above that.   Okay.   So what are ...where have we arrived here? Well,   in summary, we...in this experiment we use DSD (DOE) to collect the data only.   Then we use stepwise regression to narrow down the important effects, but we didn't go too deep into the stepwise regression. And we use common sense to minimize the number of factors in our experiment as well as engineering expertise.   We also use independent uniform inputs,   which is very powerful   for giving us an idea of the magnitude of effects using, for example by colorizing the profile or by looking at the rank of the effects and also by looking at the difference between the main effect and the total effect to give us an indication of interaction present in the model.   We also added sensitivity indicators under the profiler to help us quantify our global sensitivity for the purposes of the Monte Carlo optimization   schema that we employed.   The main effects in the model really, temperature, of course, and physics really explained explains why temperature's the number one factor as, as I've shared in our findings.   And in addition, from between 80-90 degrees C, what we see from the profilers that we observed sort of a rapid transition and an increase in the sensitivity of the relationship between rising time and temperature which is, of course, consistent with our experimental observations.   Secondly, with respect to the the effects of factors interacting with each other and because there are two different types of physics really interacting, basic physics...physics modes interacting are convection and conduction,   the stepwise on the DSD is a good starting point, because it gives us a continuous model with no transformation   With no   Advanced neural or black box type transformation, wo we can at least get a good handle on on global sensitivity to begin with.   And our neural models in our partition models couldn't show us this, particularly given the small sample size in our experiment.   And finally, we use Monte Carlo simulate, robust Monte Carlo simulation   in our own framework. And we also did a little bit of multiple response optimization on rising time and throughput in power consumption versus our important factors. And through this experiment, we began to really qualify and further our understanding of the importance of   the most important factors in this experiment using a multi disciplinary modeling approach.   Finally, I will share some references here for you for of interest. Thank you very much for your time and
Lucas Beverlin, Statistician, Intel, Corp.   The Model Comparison platform is an excellent tool for comparing various models fit within JMP Pro. However, it also has the ability to compare models fit from other software as well. In this presentation, we will use the Model Comparison platform to compare various models fit to the well-known Boston housing data set in JMP Pro15, Python, MATLAB, and R. Although JMP can interact with those environments, the Model Comparison platform can be used to analyze models fit from any software that can output its predictions.     Auto-generated transcript...   Speaker Transcript Lucas Okay, thanks everyone for coming and listen to my talk. My name is Lucas Beverlin. I'm a statistician at Intel. And today I'm going to talk about using JMP to compare models from various environments. Okay so currently JMP 15 Pro is the latest and greatest that JMP has out and if you want to fit the model and they've got several different tools to do that. There's the fit model platform. There's the neural platform on neural network partition platform. If you want classification and regression trees. The nonlinear platform for non linear modeling and there's several more. And so within JMP 15 I think it came out in 12 or 13 but this model comparison platform is a very nifty tool that you can use to compare model fits from various platforms within JMP. So if you have a tree and a neural network and you're not really sure which one's better. Okay, you could flip back and forth between the two. But now with this, you have everything on the same screen. It's very quick and easy to tell, is this better that that, so on so forth. So that being said, JMP can fit a lot of things, but, alas, it can't fit everything. So just give a few ideas of some things that can't fit. So, for example, those that do a lot of machine learning and AI might fit something like an auto encoder or convolutional neural network that generally requires lots of activation functions or yes, lots of hidden layers nodes, other activation functions than what's offered by JMP so JMP's not going to be able to do a whole lot of that within JMP. Another one is something called projection pursuit regression. Another one is called multivariate adaptive regression splines. So there are a few things unfortunately JMP can't do. R, Python, and MATLAB. There's several more out there, but I'm going to focus on those three. Now that being said, the ideas I'm going to discuss here, you want to go fit them in C++ or Java or Rust or whatever other language comes to mind, you should be able to use a lot of those. So we can use the model comparison platform, as I've said, to compare from other software as well. So what will you need? So the two big things you're going to need are the model predictions from whatever software you use to fit the model. And generally, when we do model fitting, particularly with larger models, you may split the data into training validation and/or test sets. You're going to need something that tells all the software which is training, which is validation, which is test, because you're going to want those to be consistent when you're comparing the fits. OK, so the biggest reason I chose R, Python and MATLAB to focus on for this talk is that turns out JMP and scripting language can actually create their own sessions of R and run code from it. So this picture here just shows very quickly if I wanted to fit a linear regression model to some output To to the Boston housing data set. I'll work a lot more with that data set later. But if you wanted to just very quickly fit a linear regression model in R and spit out the predictive values, you can do that. Of course you can do that JMP. But just to give a very simple idea. So, one thing to note so I'm using R 3.6.3 but JMP can handle anything as long as it's greater than 2.9. And then similarly, Python, you can call your own Python session. So here the picture shows I fit the linear regression with Python. I'm not going to step through all the lines of code here but you get the basic idea. Now, of course, with Python be a little bit careful in that the newest version of Python 3.8.5. But if you use Anaconda to install things, JMP has problems talking to it when it's greater than 3.6 so since I'm using 3.6.5 for this demonstration. And then lastly, we can create our own MATLAB session as well. So here I'm using MATLAB 2019b. But basically, as long as your MATLAB version has come out in the last seven or eight years, it should work just fine. Okay, so how do we tie everything together? So really, there's kind of a four-step process we're going to look at here. So first off, we want to fit each model. So we'll send each software the data and which set each observation resides. Once we have the models fit, we want to output those fits and their predictions and add them to a data table that JMP can look at. So of course my warning is, be sure you name things that you can tell where did you fit the model or how did you fit the model. I've examples of both coming up. So step three, depending on the model and you may want to look at some model diagnostics. Just because a model fits...appears to fit well based on the numbers, one look at your residual plot, for example, and you may find out real quickly the area of biggest interest is not fit very well. Or there's something wrong with residuals so on so forth. So we'll show how to output that sort of stuff as well. And then lastly we'll use the model comparison platform, really, to bring everything into one big table to compare numbers, much more easily as opposed to flipping back and forth and forth and back. Okay, so we'll break down the steps into a little more detail now. So for the first step where we do model fitting, we essentially have two options. So first off, we can tell JMP via JSL to call your software of choice. Send it the data and the code to fit it. And so, in fact, I'm gonna jump out of this for a moment and do exactly that. So you see here, I have some code for actually calling R. And then once it's done, I'll call Python and once it's done, I'll call MATLAB and then I'll tie everything together. Now I'll say more about the code here in a little bit, but it will take probably three or four minutes to run. So I'm going to do that now. And we'll come back to him when we're ready. So our other option is we create a data set with the validation. Well, we create a data set with the validation column and and/or a test column, depending on how many sets were splitting our data into. We're scheduled to run on whatever software, we need to run on, of course output from that whatever it is we need. So of course a few warnings. Make sure you're...whatever software you're using actually has what you need to fit the model. Make sure the model is finished fitting before you try to compare it to things. Make sure the output format is something JMP can actually read. Thankfully JMP can read quite a few things, so that's not the biggest of the four warnings. But as I've warned you earlier, make sure the predictions from each model correspond to the correct observations from the original data set. And so that comes back to the if it's training, if it's a training observation, when you fit it in JMP, it better be a training observation when you fit it in whatever software using. If it's validation in JMP, it is the validation elsewhere. It's test in JMP, it's going to be test elsewhere. So make sure things correspond correctly because the last thing we want to find out is to look at test sets and say, "Oh, this one fit way better." Well, it's because the observations fit in it were weighted different and didn't have any real outliers. So that ends up skewing your thinking. So a word of caution, excuse me, a word of caution there. Okay. So as I've alluded to, I have an example I'm currently running in the background. And so I want to give a little bit of detail as far as what I'm doing. So it turns out I'm going to fit neural networks in R and Python and MATLAB. So if I want to go about doing that, within R, two packages I need to install in R on top of whatever base installing have and that's the Keras package and the Tensorflow package. numpy, pandas and matplotlib. So numpy to do some calculations pretty easily; pandas, pandas to do data...some data manipulation; and matplotlib should be pretty straightforward to create some plots. And then in MATLAB I use the deep learning tool box, whether you have access to that are not. Okay. So step two, we want to add predictions to the JMP data table. So if you use JMP to call the software, you can use JSL code to retrieve those predictions and add them into a data table so then you can compare them later on. So then the other way you can go about doing is that the software ran on its own and save the output, you can quickly tell JMP, hey go pull that output file and then do some manipulation to bring the predictions into whatever data table you have storing your results. So now that we can also read the diagnostic plots. In this case what I generally am going to do is, I'm going to save those diagnostic plots as graphics files. So for me, it's going to be PNG files. But of course, whichever graphics you can use. Now JMP can't hit every single one under the sun, but I believe PNG that maps jpgs and some and they...they have your usual ones covered. So the second note I use this for the model comparison platform, but to help identify what you...what model you fit and where you fit it, I generally recommend adding the following property for each prediction column that you add. And so we see here, we're sending the data table of interest, this property called predicting. And so here we have the whatever it is you're using to predict things (now here in value probably isn't the best choice here) but but with this creator, this tells me, hey, what software did I use to actually create this model. And so here I used R. It shows R so this would actually fit on the screen. Python and MATLAB were a little too long, but we can put whatever string we want here. You'll see those when I go through the code in a little more detail here shortly. So, and this comes in handy because I'm going to fit multiple models within R later as well. So if I choose the column names properly and I have multiple ones where R created it, I still know what model I'm actually looking at. Okay, so this is what the typical model comparison dialog box looks like. So one thing I'm going to note is that, so this is roughly what it would look like if I did a point and click at the end of all the model fitting. So you can see I have several predictors. So I've neural nets for a MATLAB, Python and R. Various prediction forms; I used to JMP to fit a few things. Now, oftentimes what folks will do is, they'll put this validation column as a group, so that it'll group the training validation and test. I actually like the output a bit better when I stick it in the By statement here. So I'll show that here a little later. But you can put it either or but I like the output this way better is the long and short of it. So this is the biggest reason why it is now I can clearly see, these are all the training, these are all the validation (shouldn't see by the headers) and these are all the test. If you use validation as a group variable, you're going to get one big table with 21 entries in it. Now, there'll be an extra column. It says training validation test or in my case, it will be 012 but this way with the words, I don't have to think as hard. I don't have to explain to anyone what 012 means so on so forth. So that was why I made the choice that I did. Okay, so in the example I'm gonna break down here, I'm going to use the classic Boston housing data set. Now this is included within JMP. So that's why I didn't include it as a file in my presentation because if you have JMP, you've already got it. So Harrison and Rubinfeld had several predictors of the median value of the house, such things such as per capita crime rate, the proportion of non retail business acres per town, average number of rooms within whatever it is you're trying to buy, pupil to the teacher ratio by town (so if there's a lot of teachers and not quite as many students that generally means better education is what a lot of people found) and several others. I'm not gonna really try to go through all 13 of them here. Okay, so let me give a little bit of background as far as what models I looked at here. And then I'm going to delve into the JSL code and how I fit everything. So some of the models, I've looked at. So first off, the quintessential linear regression model. So here you just see a simple linear regression. I just fit the median value to, looks like, by tax rate. But of course I'll use a multiple linear regression and use all of them. So, But with 13 different predictors, and knowing some of them might be correlated to one another, I decided that maybe a few other types of regression would be worth looking at. One of them is something called bridge regression. So really all it is, is it's linear regression with essentially an added constraint that the squared values of my parameters can't be larger than some constant. I can...turns out I can actually rewrite this as an optimization problem where some value of lambda corresponds to some value of C. And so then I'm just trying to minimize this with this extra penalty term, as opposed to the typical least squares that you're used to seeing. Now, this is called a shrinkage method because of course as I make this smaller and smaller, it's going to push all these closer and closer to zero. So, of course, some thought needs to be put into how restrictive do I want it to be. Now with shrinkage, it's going to push everybody slowly towards zero. But with another type of penalty term, I can actually eliminate some terms altogether. And I can use something called the lasso. And so the contraint here is, okay, instead of the squared parameter estimates, I'm just going to take the sum of the absolute value of those parameter estimates. And so it turns out from that, what'll actually happen is those that are very weak actually get their parameter estimates set to zero itself, which kind of serves as a elimination, if you will, of unimportant terms. So to give a little bit of a visual as to what lasso and ridge regression are doing. So for ridge regression, the circle here represents the penalty term. And here we're looking at the parameter space. And so the true least squares estimates would be here. So we're not quite getting there, because we have this additional constraint. So in the end, we find where does...where do we get the minimum value that touches the circle, basically. And so this is, this would be our ridge regression parameter estimates. For lasso, similar drawing, but you can see now with the absolute value, this is more of a diamond as opposed to a circle. Now note, this is two dimensions, of course, we're going to get into hyper spheres and all those shapes. But you can see here, notice it touches right at the tip of the diamond. And so in this case beta one is actually zero. So that's how it eliminates terms. Okay, so another thing we're going to look at is what's called a regression tree. Now JMP uses the partition platforms to do these. So just to give a very quick demo of what this shows, in that, ok so I have all of my data. And so my first question I'll ask myself is, how many rooms are in the dwelling, and I know I can't have .943 of a room, so basically, I have six rooms or less. So come down this part of the tree, let's not come down this part of the tree. Now if I have seven rooms or more, this tells me immediately I'm going to predict my median value to be 37. Remember it's in tens of thousands of dollars, so make that $370,000. If it's less than seven, then the next question I asked is, well, how big is lstat? So if it's bigger than or equal to 14.43, I'll look at this node and suddenly my median housing estimates about 150 grand and if I come over here, it's gonna be about 233 grand. So what regression trees really do is they're partitioning your input space into different areas. And we'er giving the same prediction to every value within that area. So you can see here I've partitioned...now on this case, I'm taking a two dimensional one because it's easier to draw... and so you can see this tree here, where I first look at x1. Now look at x2 here and ask another question about x1 and ask another question about x2, and this is how I end up partitioning the input space. Now each of these five is going to have a prediction value. And that's essentially what this looks like. I look at this from up top. I'm going to get exactly this. But you can see here that the prediction is a little bit different depending upon which of the five areas right. Now, I'm not going to get into too much of the details on how exactly to fit one of these, but James, Witten, Tibshirani and Friedman give a little bit; Leo ??? wrote the seminal book on it so you can take a look there. So next off, I'll come to neural networks, which are being used a lot in machine learning and whatnot these days. And so this kind of gives a visual of what a neural network looks like. So here, this visual just uses five and the 13 inputs when passing them to these...this hidden layer. And each of these is transformed via an activation function. And for each of these activation functions, you get an output and we'll use... oftentimes, we'll just use a linear regression of these outputs to predict the median value. Okay, so really, neural network are nonlinear models, at the end of the day, and really, they're called neural networks because the representation generally is how we viewed neurons as working within the human brain. So each input can be passed to nodes in a hidden layer. At the hidden layer your inputs are pushed through an activation function and some output is calculated and each output can be passed to a node in another hidden layer or be an output of the network. Now within JMP, you're only allowed two hidden layers. Truth be told, as far as creating a neural network, there's nothing that says you can't have 20 for all that we're concerned about now. Truth be told, there's some statistical theory that suggests that hey, we can approximate any continuous function, given a few boundary conditions, with two hidden layers. So that's likely why JMP made the decision that they did. linear, hyperbolic tangent and Gaussian radial basis. So in fact, on these nodes here, notice the little curve here. I believe that is for the hyperbolic tangent function; linear will be a straight line going up; and Gaussian radial basis, I believe, will look more like a normal curve. That's the neural network platform. So the last one we'll look at is something called projection pursuit regression. I wanted to pull something that JMP simply can't do just kind of give an example here. Um, so projection pursuit regression was a model originally proposed by Jerome Friedman and Steutzle over at Stanford. Their model takes prediction...makes predictions of the form y equals the summation of beta i, f sub i, and a linear transformation of your inputs. So really this is somewhat analogous to a neural network. You have one hidden layer here with k nodes and each with activation function f i L. Turns out with projection pursue regression, we're actually going to estimate these f sub i as well. Generally they're going to be some sort of smoother or a spline fit. Typically the f sub i are called ridge functions. Now we have alphas, we have betas and we have Fs we need to optimize over. So generally a stagewise fitting is done. I'm not going to get too deep in the details at this point. Okay, so I've kind of gone through all my models. So now I'm going to show some output and hopefully things look good. So one thing I'm going to note before I get into JMP is that it's really hard to set seeds for the neural networks in R, Python for Keras. So do note that if you take my code and run it, you're probably not going to get exactly what I got, but it should be pretty close. So with that said, let's see what we got here. So this was the output that I got. Now, unfortunately, things do not appear to have run perfectly. So, Lucas what do I have here? So I have my training, my validation, and my test. And so we see very quickly that one of these models didn't fit very well. The neural net within R unfortunately something horrible happened. It must have caught a bad spot in the input space to start from and whatnot. And so it just didn't fit a very good model. So unfortunately, starting parameters with nonlinear models matter; in some cases, we get bit by them. But if we take a look at everything else, everything else seems to fit decently well. Now what is decently well, we can argue over that, but I'm seeing R squares, one above .5. I'm seeing root average squared errors here around five or so, and even our average absolute errors are in the three range. Now for training, it looks like projection pursuit regressions did best. If I come down to the validation data set, it still looks like R projection pursuit did best. But if we look at the test data set, all of a sudden, no, projection pursuit regression was second, assuming we're gonna ignore the neural net from R, second worst. Oftentimes in a framework like this, we're going to look at the test data set the closest because it wasn't used in any way, shape, or form to determine the model fit. And we see based on that, It looks like the ridge regression from JMP fit best. We can see up here, it's R squared was .71 here before was about .73, and about .73 here, so we can see it's consistently fitting the same thing through all three data sets. So if I were forced to make a decision, just based on what I see at the moment, I would probably go with the ridge regression. So that being said, we have a whole bunch of diagnostics and whatnot down here. So if I want to look at what happened with that neural network from R, I can see very quickly, something happened just a few steps into there. As you can see, it's doing a very lousy job of fitting because pretty much everything is predicted to be 220 some thousand. So we know something went wrong during the fitting of this. So we saw the ridge regression looked like the best one. So let's take a look at what it spits out. So I'll show in a moment my JSL code real quick that shows how I did all this but, um, we can see here's the parameter estimates from the ridge regression. We can see the ridge diagnostic plots, so things hadn't really shrunk too much from the original estimates. You can see from validation testing with log like didn't whatnot. And over here on the right, we have our essentially residual plots. These are actual by predicted. So you can see from the training, looks like there was a few that were rather expensive that didn't get predicted very well. We see fewer here than in the test set, it doesn't really look like we had too much trouble. We have a couple of points here a little odd, but we can see for generally when we're in lower priced houses, it fits all three data sets fairly well. Again, we may want to ask ourselves what happened on these but, at the moment, this appears to be the best of the bunch. So we can see from others. See here. So we'll look at MATLAB for a moment. So you can see training test validation here as well. So here we're spitting out...MATLAB spits out one thing of diagnostics and you can see it took a few epochs to finish so. But thankfully MATLAB runs pretty quickly as we can tell. And then the actual by predicted here. We can see all this. Okay, so I'm going to take a few minutes now to take a look at the code. So of course, a few notes, make sure things are installed so you can actually run all this because if not, JMP's just going to fail miserably, not spit out predictions and then it's going to fail because it can't find the predictions. So JMP has the ability to create a validation column with code. So I did that I chose 60 20 20. I did choose that random seed here so that you can use the same training validation test sets that I do. So actually, for the moment, what I end up doing is I save what which ones are training, validation and test. I'm actually going to delete that column for a little bit. The reason I do that here is because I'm sending the data set to R, Python and MATLAB and it's easier to code when everything in it is either the output or all the inputs. So I didn't want a validation column that wasn't either and then it becomes a little more difficult do that. So what I ended up doing was I sent it the data set, I sent it which rows of training, validation, and test, and then I call the R code to run it. Now you can actually put the actual R code itself within here. I chose to just write one line here so that I don't have to scroll forever. But there's nothing stopping you. If it's only a few lines of code, like what you saw earlier in the presentation, I would just paste it right in here. So that once it's done, it turns out...this code spits out a picture of the diagnostics. We saw it stopped after six or seven iterations, let's have this say is that out. And also fits the ridge regression in this script so we get two pictures. So I spit that one out as well and save it and outline box. Now, these all put together at the end of all the code. And then I get the output and I'll add those to the data table here in a little while. Okay, so I give a little bit of code here in that. Let's say you have 6 million observations and it's going to take 24 hours to actually fit the model, you're probably not going to want to run it within JMP. So as a little bit of code that you could do from here, you can say, hey, I'm okay, I'm going to just open the data table I care about. I'm going to tell R to go run it somewhere else in the meantime, and once my system, when I gives me the green light that hey, it's done, I can say, okay, well go open the output from that and bring it into my data table. So this would be one way you could go about doing some of that. And of course you want to save those picture file somewhere and use this code as well. But this is gonna be the exact same code. Okay, so for Python, it's going to be very similar. I'm going to pass it these things. Run some Python code, spit out the diagnostic plot and spit out the predictions. And I give some Python code, you can see, it's very, very similar to what we did from JMP. I'm just going to go open some CSV file in this case. Copy the column in and close it, because I don't need it anymore. And then MATLAB again the exact same game. Asset things, run the MATLAB script. I get the PNG file that I spat out of here. Save it where I need to, save the predictions. And if you need to grab it rather than run it from within here, a little bit of sample code will do that. OK, so now that I'm done calling R, Python and Matlab, I bring back my validation columns so that JMP can use it. So since I remember which one's which. So by default, JMP looks at the values within the validation column which we'll use. The smallest value is training, the next largest is validation, the largest is test. Now if you do K fold cross validation, it'll tell it which fold it is. So coursing though 012345678 so on so forth. So then create this. I also then in turn created this here, so that way instead of 012, it'll actually say training, validation, and test in my output, so it's a little clearer to understand. So if I'm going to show someone else that's never run JMP before, they're not going to know what 012 means, but they should know a training, validation and test are. OK, so now I start adding the predictions to my data table. Um, so here's that set property I alluded to earlier in my talk. So my creator's MATLAB, I've given the column name so, hey, I know it's the neural net prediction for MATLAB. So I may not necessarily need the creator, but in case I'm a little sloppy in naming things Sorry about that. So we can get all the projection pursuit regression, neural nets, and whatnot. Then I also noted that, hey, I fit ridge regression, lasso, linear regression in JMP. So I did all that here. So here I do my fit model, my generalize regression. Get all these spat out. Sve my prediction formulas. Plot my actual by predicted for my full output at the end. And I'm going to fit my neural network. Can I say the validation column. I transfer my covariates, generally neural networks tend to fit a little bit better when we scale things around zero as opposed to whatever the output is usually at. So my first hidden layer has three nodes. My second hidden layer has two nodes. Here they're both linear activation functions. Turns out for the three above, I use the restricted linear units activation function so slightly different, but I found they seem to fit about the same regardless. 5. What that means is, hey, I'm going to try five different sets of starting values, whichever one does best is what I'm going to keep. As you can tell from my code, I probably should have done that with the R. It's done kind of a four loop, done several of them, spit out the one that does best. So for future work, that would be one spot I would go. So then I save that stuff out and now I'm ready for the model comparison. So now I bring all those new columns in the model comparison. Scroll over a little bit. So here I'm doing the by validation, as I alluded to earlier. And so lastly I'm just doing a bit of coding to essentially make it look the way I want it to look. So I get these outline boxes, just to say training diagnostics, validation diagnostics, test diagnosis, instead of the usual stuff that JMPs says. I'm gonna get these diagnostic plots. Now here I'm just saying I just only want part of the outputs on grabbing a handle to that. I'm going to make some residual plots real quick because not all of them instantly spit those out, so particularly ones from MATLAB, Python and R. Set those titles and then here I just create the big old table or the big old dialogue. And then I journal everything. So that it's nice and clean. Close a bunch of stuff out, so I don't have to worry about things. And then what I did here at the end is what I wanted to happen is when I pop one of these open, everything else below it is immediately open rather than having to click on six or seven different things. You can see, I have to click here and here and Over here, there's three more. I guess one more. Sorry. But this way, I don't have to click on any of these, they're automatically open. So that's what this last bit of code does. Okay. Lucas So this is just different output it and run it live. But this is where it can also look like. So as I mentioned, so in the code, you saw there, you saw something else that's what we saw. Richard Lucas Nope. So to wrap everything up the model comparison platform is really a very nice tool for comparing the predictive ability of multiple models in one place. You don't have to cut back and forth between various things. You can just look at everything right in front of you. The flexibility can even be used to fit models that weren't fit in JMP or compare models that weren't even fit in JMP. And so with this, if we need to fit very large models that take a long time to fit, we can tell them to go fit. Pull everything in JMP and very easily look at all the results to try to determine next steps. And with that, thank you for your time.  
Shamgar McDowell, Senior Analytics and Reliability Engineer, GE Gas Power Engineering   Faced with the business need to reduce project cycle time and to standardize the process and outputs, the GE Gas Turbine Reliability Team turned to JMP for a solution. Using the JMP Scripting Language and JMP’s built-in Reliability and Survival platform, GE and a trusted third party created a tool to ingest previous model information and new empirical data which allows the user to interactively create updated reliability models and generate reports using standardized formats. The tool takes a task that would have previously taken days or weeks of manual data manipulation (in addition to tedious copying and pasting of images into PowerPoint) and allows a user to perform it in minutes. In addition to the time savings, the tool enables new team members to learn the modeling process faster and to focus less on data manipulation. The GE Gas Turbine Reliability Team continues to update and expand the capabilities of the tool based on business needs.       Auto-generated transcript...   Speaker Transcript Shamgar McDowell Maya Angelou famously said, "Do the best you can, until you know better. Then when you know better, do better." Good morning, good afternoon, good evening. I hope you're enjoying the JMP Discovery Summit, you're learning some better way ways of doing the things you need to do. I'm Shamgar McDowell, senior reliability and analytics engineer at GE Gas Power. I've been at GE for 15 years and have worked in sourcing, quality, manufacturing and engineering. Today I'm going to share a bit about our team's journey to automating reliability modeling using JMP. Perhaps your organization faces a similar challenge to the one I'm about to describe. As I walk you through how we approach this challenge, I hope our time together will provide you with some things to reflect upon as you look to improve the workflows in your own business context. So by way of background, I want to spend the next couple of slides, explain a little bit about GE Gas Power business. First off, our products. We make high tech, very large engines that have a variety of applications, but primarily they're used in the production of electricity. And from a technology standpoint, these machines are actually incredible feats of engineering with firing temperatures well above the melting point of the alloys used in the hot section. A single gas turbine can generate enough electricity to reliably power hundreds of thousands of homes. And just to give an idea of the size of these machines, this picture on the right you can see there's four adult human beings, which just kind of point to how big these machines really are. So I had to throw in a few gratuitous JMP graph building examples here. But the bubble plot and the tree map really underscore the global nature of our customer base. We are providing cleaner, accessible energy that people depend upon the world over, and that includes developing nations that historically might not have had access to power and the many life-changing effects that go with it. So as I've come to appreciate the impact that our work has on everyday lives of so many people worldwide, it's been both humbling and helpful in providing a purpose for what I do and the rest of our team does each day. So I'm part of the reliability analytics and data engineering team. Our team is responsible for providing our business with empirical risk and reliability models that are used in a number of different ways by internal teams. So in that context, we count on the analyst in our team to be able to focus on engineering tasks, such as understanding the physics that affect our components' quality and applicability of the data we use, and also trade offs in the modeling approaches and what's the best way to extract value from our data. These are, these are all value added tasks. Our process also entails that we go through a rigorous review with the chief engineers. So having a PowerPoint pitch containing the models is part of that process. And previously creating this presentation entailed significant copying and pasting and a variety of tools, and this was both time consuming and more prone to errors. So that's not value added. So we needed a solution that would provide our engineers greater time to focus on the value added tasks. It would also further standardize the process because those two things greater productivity and ability to focus on what matters, and further standardization. And so to that end, we use the mantra Automate the Boring Stuff. So I wanted to give you a feel for the scale of the data sets we used. Often the volume of the data that you're dealing with can dictate the direction you go in terms of solutions. And in our case, there's some variation but just as a general rule, we're dealing with thousands of gas turbines in the field, hundreds of track components in each unit, and then there's tens of inspections or reconditioning per component. So in in all, there's millions of records that we're dealing with. But typically, our models are targeted at specific configurations and thus, they're built on more limited data sets with 10,000 or fewer records, tens of thousands or fewer records. The other thing I was going to point out here is we often have over 100 columns in our data set. So there are challenges with this data size that made JMP a much better fit than something like an Excel based approach to doing this the same tasks. So, the first version of this tool, GE worked with a third party to develop using JMP scripting language. And the name of the tool is computer aided reliability modeling application or CARMA, with a c. And the amount of effort involved with building this out to what we have today is not trivial. This is a representation of that. You can see the number of scripts and code lines that testified to the scope and size of the tool as it's come to today. But it's also been proven to be a very useful tool for us. So as its time has gone on, we've seen the need to continue to develop and improve CARMA over time. And so in order to do this, we've had to grow and foster some in-house expertise in JSL coding and I oversee the work of developers that focus on this and some related tools. Message on this to you is that even after you create something like CARMA, there's going to be an ongoing investment required to maintain and keep the app relevant and evolve it as your business needs evolve. But it's both doable and the benefits are very real. A survey of our users this summer actually pointed to a net promoter score of 100% and at least 25% reduction in the cycle time to do a model update. So that's real time that's being saved. And then anecdotally, we also see where CARMA has surfaced issues in our process that we've been able to address that otherwise might have remained hidden and unable to address. And I have a quote, it's kind of long. But I wanted to just pass this caveat on automation from Bill Gates, on which he knows a thing or two about software development. "The first rule of any technology used in business is that automation applied to an efficient automation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency." So that's the end of the quote, but this is just a great reminder that automation is not a silver bullet that will fix a broken process, we still need people to do that today. Okay, so before we do a demonstration of the tool. I just wanted to give a high level overview of of the tool and the inputs and outputs in CARMA. And user user has to point the tool to the input files. So over here on the left, you see we have an active models file that's essentially the already approved models. And then we have empirical data. And then in the user interface, the user does some modeling activities. And then outputs are running models, so updates to the act of models in a PowerPoint presentation. And we'll also look at that. As a background for the data, I'll be using the demo. I just wanted to pass on that I started with the locomotive data set. And we'll see that JMP provides some sample data. So this is the case, there and that gives one population. Then I also added into additional population of models. And the big message here I wanted to pass on is that what we're going to see is all made-up data. It's not real; it doesn't represent the functionality or the behavior of any of our parts in the field, or it's it's just all contrived. So keep that in mind as we go through the results, but it should give us a way to look at the tool, nonetheless. So I'm going to switch over to JMP for a second and I'm using JMP 15.2 for the demo. And this data set is simplified compared to what we normally see. But like I said, it should exercise the core functionality in CARMA. So first, I'm just going to go to the Help menu, sample data, and you'll see the reliability and survival menu here. So that's where we're going. One of the nice things about JMP is that it has a lot of different disciplines and functionality and specialized tools for that. And so for my case with reliability, there's a lot here, which also lends to the value of using JMP is a home for CARMA. But I wanted to point you to the locomotive data set and just show you... this originally came out of a textbook. And talks to it here applied life data analysis. So, in that, there's a problem that asks you what the risk is at 80,000 exposures and we're going to model that today in our data set in an oxidation model is what we've called it, but essentially CARMA will give us an answer. Again, a really simple answer, but I was just going to show you, you can get the same way by clicking in the analysis menu. So we go down to an analyze or liability and survival, life distribution. Put the time and sensor where they need to go. We're going to use Weibull and just the two...so it creates a fit for that data. Two parameters I was going to point out is the beta, 2.3, and then it's called a Weibull alpha here. In our tool, it'll be called Ada, but 183. Okay, so we see how to do that here. Now just to jump over, want to look at a couple of the other files, the input files so I will pull those up. Okay, this is the model file. I mentioned I made three models. And so these are the active models that we're going to be comparing the data against. You'll see that oxidation is the first one, I mentioned that, and then you want...one also, in addition to having model parameters, it has some configuration information. This is just two simple things here (combustion system, fuel capability) I use for examples, but there's many, many more columns, like it. But essentially what CARMA does, one of the things I like about it is when you have a large data set with a lot of different varied configurations, it can go through and find which of those rows of records applies to your model and do the sorting real time, and you know, do that for all the models that you need to do in the data set. And so that's what we're going to use that to demonstrate. Excuse me. Also, just look, jump over to the empirical data for a minute. And just a highlight, we have a sensor, we have exposures, we have the interval that we're going to evaluate those exposures at, modes, and then these are the last two columns I just talked about, combustion system and fuel capability. Okay, so let's start up CARMA. As an add in, so I'll just get it going. And you'll see I already have it pointing to the location I want to use. And today's presentation, I'm not gonna have time to talk through all the variety of features that are in here. But these are all things that can help you take and look at your data and decide the best way to model it, and to do some checks on it before you finalize your models. For the purposes of time, I'm not going to explain all that and demonstrate it, but I just wanted to take a minute to build the three models we talked about create a presentation so you can see that that portion of the functionality. Excuse me, my throat is getting dry all the sudden so I have to keep drinking; I apologize for that. So we've got oxidation. We see the number of failures and suspensions. That's the same as what you'll see in the text. Add that. And let's just scroll down for a second. That's first model added Oxidation. We see the old model had 30 failures, 50 suspensions. This one has 37 and 59. The beta is 2.33, like we saw externally and the ADA is 183. And the answer to the textbook question, the risk of 80,000 exposures is about 13.5% using a Weibull model. So that's just kind of a high level of a way to do that here. Let's look at also just adding the other two models. Okay, we've got cracking, I'm adding in creep. And you'll see in here there's different boxes presented that represent like the combustion system or the fuel capability, where for this given model, this is what the LDM file calls for. But if I wanted to change that, I could select other configurations here and that would result in changing my rows for FNS as far as what gets included or doesn't. And then I can create new populations and segment it accordingly. Okay, so we've gotten all three models added and I think, you know, we're not going to spend more time on that, just playing with the models as far as options, but I'm gonna generate a report. And I have some options on what I want to include into the report. And I have a presentation and this LDM input is going to be the active models, sorry, the running models that come out as a table. All right, so I just need to select the appropriate folder where I want my presentation to go And now it's going to take a minute here to go through and and generate this report. This does take a minute. But I think what I would just contrast it to is the hours that it would take normally to do this same task, potentially, if you were working outside of the tool. And so now we're ready to finalize the report. Save it. And save the folder and now it's done. It's, it's in there and we can review it. The other thing I'll point out, as I pull up, I'd already generated this previously, so I'll just pull up the file that I already generated and we can look through it. But there's, it's this is a template. It's meant for speed, but this can be further customized after you make it, or you can leave placeholders, you can modify the slides after you've generated them. It's doing more than just the life distribution modeling that I kind of highlighted initially. It's doing a lot of summary work, summarizing the data included in each model, which, of course, JMP is very good for. It, it does some work comparing the models, so you can do a variety of statistical tests. Use JMP. And again, JMP is great at that. So that, that adds that functionality. Some of the things our reviewers like to see and how the models have changed year over year, you have more data, include less. How does it affect the parameters? How does it change your risk numbers? Plots of course you get a lot of data out of scatter plots and things of that nature. There's a summary that includes some of the configuration information we talked about, as well as the final parameters. And it does this for each of the three models, as well as just a risk roll up at the end for for all these combined. So that was a quick walkthrough. The demo. I think we we've covered everything I wanted to do. Hopefully we'll get to talk a little more in Q&A if you have more questions. It's hard to anticipate everything. But I just wanted to talk to some of the benefits again. I've mentioned this previously, but we've seen productivity increases as a result of CARMA, so that's a benefit. Of course standardization our modeling process is increased and that also allows team members who are newer to focus more on the process and learning it versus working with tools, which, in the end, helps them come up to speed faster. And then there's also increased employee engagement by allowing engineers to use their minds where they can make the biggest impact. So I also wanted to be sure to thank Melissa Seely, Brad Foulkes, Preston Kemp and Waldemar Zero for their contributions to this presentation. I owe them a debt of gratitude for all they've done in supporting it. And I want to thank you for your time. I've enjoyed sharing our journey towards improvement with you all today. I hope we have a chance to connect in the Q&A time, but either way, enjoy the rest of the summit.  
Monday, October 12, 2020
Laura Castro-Schilo, JMP Research Statistician Developer, SAS Institute, JMP Division James Koepfler, Research Statistician Tester, SAS Institute, JMP Division   This presentation provides a detailed introduction to Structural Equation Modeling (SEM) by covering key foundational concepts that enable analysts, from all backgrounds, to use this statistical technique. We start with comparisons to regression analysis to facilitate understanding of the SEM framework. We show how to leverage observed variables to estimate latent variables, account for measurement error, improve future measurement and improve estimates of linear models. Moreover, we emphasize key questions analysts’ can tackle with SEM and show how to answer those questions with examples using real data. Attendees will learn how to perform path analysis and confirmatory factor analysis, assess model fit, compare alternative models and interpret all the results provided in the SEM platform of JMP Pro.   PRESENTATION MATERIALS The slides and supplemental materials from this presentation are available for download here. Check out the list of resources at the end of this blog post to learn more about SEM. The article and data used for the examples of this presentation are available in this link.     Auto-generated transcript...   Speaker Transcript Laura Castro-Schilo Hello everyone. Welcome. This session is all about the ABCs of Structural Equation modeling and what I'm going to try is to leave you with enough tools to be able to feel comfortable   to specify models and interpret the results of models fit using the structural equation modeling platform and JMP Pro.   Now what we're going to do is start by giving you an introduction of what structural equation modeling is and particularly drawing on the connections it has to factor analysis and regression analysis.   And then we're going to talk a little bit about how path diagrams are used in SEM and their important role within this modeling framework.   I'm going to try to keep that intro short so that we can really spend time on our hands-on examples. So after the intro, I'm going to introduce the data that I'm going to be using for demonstration. It turns out these data are   about perceptions of threats of the Covid 19 virus. So after introducing those data,   we're going to start looking at how to specify models and interpret their results within the platform, specifically by answering   a few questions. Now, these, these questions are going to allow us to touch on two very important techniques that can be done with an SEM. One is confirmatory factor analysis and also multivariate regression.   And to wrap it all up, I'm going to show you just a brief   model in which we bring together both the confirmatory factor model, regression models and that way you can really see the potential that SEM has for using it with your own data for your own work.   Alright, so what is SEM? Structural equation modeling is a framework where factor analysis and regression analysis come together.   And from the factor analysis side, what we're able to get is the ability to measure things that we do not observe directly.   And on the regression side, we're able to examine relations across variables, whether they're observed or unobserved. So when you bring those two tools together, we end up with a very flexible framework, which is SEM, where we can fit a number of different models.   Path diagrams are a unique tool within SEM, because all statistical structural equation models, the actual models, can be depicted through a diagram. And so we have to learn just some   notation as to how those diagrams are drawn. And so squares represent observed variables,   circles represent latent variables, variances or covariances are represented with double-headed arrows, and regressions and loadings are   represented with one-headed arrows. Now, as a side note, there's also a triangle that is used for path diagrams.   But that's outside the scope of what we're going to talk about today. The triangle is used to represent means and intercepts. And unfortunately, we just don't have enough time to talk about all of the awesome things we can do with means as well.   Alright. But I also want to show you just some fundamental...kind of the building blocks of   x, which is in a box, is predicting y,   which is also in a box. So we know those are observed variables and each of them have double-headed arrows that start and end on themselves, meaning those are variances.   For x, this arrow is simply its variants, but for y, the double-headed arrows represent a residual variance.   Now, of course, you might know that in SEM, an outcome can be both an outcome and a predictor. So you can imagine having another variable z that y could also predict, so we can put together as many regressions as we are interested in within one model.   The second basic block for building SEMs are confirmatory factor models and at the most basic   most basic example of that is is shown right here, where we're specifying one factor or one latent variable, which is unobserved but it's   it's shown here with arrows pointing to w, x, and y because the latent variable is thought to cause the common variability we see across w, x, and y.   Now, this right here is called confirmatory factor model, it's, it's only one factor. And I think it's really important to understand the distinctions of a factor in the factor analytic   perspective and distinguish that from principal components or principal component analysis, which sometimes are easy to get confused.   So I'll take a quick tangent to show you what's different about a factor from a factor analytic perspective and from PCA.   So here these squares are meant to represent observed variables, the things that we're measuring, and I colored here in blue   different amounts from each observed variable, which represents the variance that those variables are intended to measure. So it's kind of like the signal. I mean, there's   it's this stuff we're really interested in. And then we have these gray areas which represent a proportion of variance that comes from other sources. It can be systematic variance, but it's simply variance that is not   what we want it to pick up from our measuring instrument. So, it contains   sources are variance that are unique to each of these variables and they also contain measurement error.   So what's the difference between factor analysis and PCA is that in factor analysis, a latent variable is going to capture only the common variance that   exists across all of these observed variables, and that is the part that the latent variable accounts for.   And that is in contrast to PCA where a principle component represents the maximal amount of variance that can be explained in the dimensions of the data.   And so you can see that the principal component is going to pick up as much variance as it can explain   and that means that there's going to be an amalgamation of variance as due to what we intended to measure and perhaps other sources of variance as well. So this is a very important distinction,   because when we want to measure unobserved variables, factor analysis is indeed a better choice, unless you know if our goal truly is dimension reduction, then PCA is an ideal tool for that.   Also notice here that the arrows are pointing in different directions. And that's because in factor analysis, there really is a an underlying assumption that that unobserved variable is causing the variability we observe.   And so that is not the case in PCA, so you can see that distinction here from the diagram. So anytime we talk about   a factor or a latent variable in SEM, we're most likely talking about a latent variable from this perspective of factor analysis. Now here's a   large structural equation model where I have put together a number of different elements from those building blocks I showed before. When we see numbers because we've estimated this model.   And you see that there's observed variables that are predicting other variables. We have some sequential relations, meaning that   one variable leads to another, which then leads to another. This is also an interesting example because we have a latent variable here that's being predicted by some observed variable. But this latent variable also   in turn, predicts other variables. And so it illustrates, this diagram illustrates nicely, a number of uses that we can have in, for basically reasons that we can have for using SEM, including the five that we have unobserved variables that we want to model within a larger   context. We want to perhaps account for measurement error. And that's important because latent variables are purged of measurement error because they only account for the variance that's in common across their indicators.   We...if you also have the need to study sequential relations across variables, whether they're observed or unobserved, SEM is an excellent tool for that.   And lastly, if you have missing data, SEM can also be really helpful even if you just have a regression, because   all of the data that are available to you will be used in SEM during estimation, at least within the algorithms that we have in JMP pro.   So that...those are really great reasons for using SEM. Now I want to use this diagram as well to introduce some important terminology that you'll for sure come across multiple times if you decide that SEM is a   a tool that you are going to use in your own work. So we talked about observed variables. Those are also called manifest variables in the SEM jargon.   There's latent variables. There are also variables called exogenous. In this example, there's only two of them.   And those are variables that only predict other variables and they are in contrast to endogenous variables, which actually have variables predicting them. So here, all of these other variables are endogenous.   Latent variables, the manifest variables that they point to, that they predict, those are also called latent variable indicators.   And lastly, we talked about the unique factor variance from a factor analytic perspective. And those are these residual variances from a factor,   from a latent variable, which is variance that is not explained by the latent variable, and represents both the combination of systematic variance that's unique to that variable plus measurement error.   All right.   I have found that in order to understand structural equation modeling in a bit easier way, it's important to shift our focus into realizing that the data that we're really   modeling under the hood is the covariance structure of our data. We also model the means, the means structure, but again, that's outside the scope of the presentation today.   But it's important to think about this because it has implications for what we think our data are.   You know, we're used to seeing our data tables and we have rows and variables are our columns, and and yes that is...   these data can be used to launch our SEM platform. However, our algorithms, deep down, are actually analyzing the covariance structure of those variables. So when you think about the data, these are really the data that are   being modeled under the hood. That also has implications for when we think about residuals, because residuals in SEM are with respect to variances and covariances of   you know, those that we estimate, in contrast to those that we have in the sample. So residuals, again, we need a little bit of a shift in focus to what we're used to, from other standard statistical techniques to really   wrap our heads around what structural equation models are. And things...concepts such as degrees of freedom are also going to be degrees of freedom with respect to the covariance   matrix. And so once we make this shift in focus, I think it's a lot easier to understand structural equation models.   Now I want to go over a brief example here of how is it that SEM estimation works. And so usually what we do is we start by specifying a model and the most   exciting thing about JMP Pro and our SEM platform is that we can specify that model directly through a path diagram, which makes it so much more intuitive.   And so that path diagram, as we're drawing it, what we're doing is we're specifying a statistical model that implies the   structure of the covariance matrix of the data. So the model implies that covariance structure, but then of course we also have access to the sample covariance matrix.   And so what happens during estimation is that we try to match the values from the sample covariance as close as possible, given what the model is telling us the relations between the data are, not within the variables.   And once we have our model estimates, then we use those to get an actual   model-implied covariance matrix that we can compare against the sample covariance matrix. So, by looking at the difference between those two matrices,   we are able to obtain residuals, which allows us to get a sense of how good our models fit or don't fit. So in a nutshell, that is how structural equation modeling works.   Now we are done with the intro. And so I want to introduce to you, tell you a little bit of context for the data I'm going to be using for our demo.   These data come from a very recently published article in the Journal of Social, Psychological and Personality Science.   And the authors in this paper wanted to answer a straightforward question. They said, "How do perceived threats of Covid 19 impact our well being and public health behaviors?" Now what's really interesting about this question is that perceived threats of   Covid 19 is a very abstract concept. It is a construct for which we don't have an instrument to measure it. It's something we do not observe directly.   And that is why in this article, they had to figure out first, how to measure those perceived threats.   And the article focused on two specific types of threats. They call the first realistic threats, because they're related to physical or financial safety.   And also symbolic threats, which are those threats that are posed on one's sociocultural identity.   And so they actually came up with this final threat scale, they went through a very rigorous process to develop this survey, this scale to measure realistic threat and symbolic threat.   Here you can see here that people had to answer how much of a threat, if any, is the corona virus outbreak for your personal health,   you know, the US economy, what it means to be an American, American values and traditions. So basically, these questions are focusing on two different aspects of the   threat of the virus, one that is they labeled realistic, because it deals with personal and financial health issues.   And then the symbolic threat is more about that social cultural component. So you can see all of those questions here.   And we got the data from one of the three studies that they that they did. And those data from 550 participants who answered all of these questions in addition to a number of other surveys and so we'll be able to use those data in our demo.   And we're going to answer some very specific questions. The very first one is, how do we go about measuring perceptions of Covid 19 threats. There's two types of threats, we're interested in. And the question is, how do we do this, given that it's such an abstract concept.   And this will take us to talk about confirmatory factor analysis and ways in which we can assess the validity and reliability of our surveys.   One thing we're not going to talk about though is the very first step, which is exploratory factor analysis. That is something that we do outside of SEM   and it is something that should be done as the very first step to develop a new scale or a new survey, but here we're going to pick up from the confirmatory factor analysis step.   A second question is do perceptions of Covid 19 threats predict well being and public health behaviors? And this will take us to talk more about regression concepts.   And lastly, are effects of each type of threat on outcomes equal? And this is where we're going to learn about a very unique   helpful feature of SEM, which is allowing us to impose equality constraints on different effects within our models and being able to do systematic model comparisons to answer these types of questions.   Okay, so it's time for our demo. So let me go ahead and show you the data, which I've already have open right here.   Notice my data tables has a ton of columns, because there's a number of different surveys that participants   responded to, but the first 10 columns here are those questions, or the answers to the questions I showed you in that survey.   And so what we're going to do is go to analyze, multivariate methods, structural equation models and we're going to use those   answers from the survey. All of the 10 items, we're going to click model variables so that we can launch our platform with those data.   Now, right now I have a data table that has one observation per row. That's what the wide data format is, and so that's what I'm going to going to use.   But notice there's another tab for summarize data. So if you have only the correlation or covariance matrix, you can now input that in...well, you will be able to do it in JMP 16,   JMP Pro 16 and so that's another option because, remember that at the end of the day, what we're really doing here is modeling covariance structures. So you can use summarize data to launch the platform.   Alright, so let's go ahead and click OK. And the first thing we see is this model specification window which allows us to do all sorts of fun things.   Let's see, on the far right here we have the diagram and notice our diagram has a default specification. So our variables   all have double headed arrows, which means they all have a variance   They also have a mean, but notice if I'm right clicking on the canvas here and I get some options to show the means or intercepts. So   again, this is outside the scope of today, so I'm going to keep those hidden but do know that the default model in the platform has variances and means for every variable that we observe.   The list tab contains the same information as the diagram, but in a list format and it will split up your   paths or effects based on their type. We also have a status step which gives you a ton of information about the model that you have at that very moment. So right now, it tells us,   you know, the degrees of freedom we have, given that this is the model we have specified here is just the default model. And it also tells us, you know, data details and other useful things.   Notice that this little check mark here changes if there is a problem with your model. So as you're specifying your model, if we encounter something that looks problematic or an error, this tab will change in color and type   and so it will be helpful to hopefully help you solve any type of issues with the specification of the model.   Okay, so on the far left, we have an area for having a model name. And we also have from and to lists. And so this makes it very easy to select variables here,   in the from and then in a to role, wherever those might be. And we can connect them with single-headed arrows or double-headed arrows, which we know, they are regressions or loadings or variances or covariances.   Now for our case right now, we really want to figure out how do we measure this unobserved construct of perceptions of Covid 19   threat. And I know that the first five items that I have here are the answers to questions that the authors labeled realistic threats. So I'm going to select those   variables and now here we're going to change the default name of latent one to realistic because that's going to be the realistic threat latent variable. So I click the plus button. And notice, we automatically give you   this confirmatory factor model with one factor for realistic threat. An interesting observation here is that there is a 1 on this first loading   that indicates that this path, this effect of the latent variable on the first observed variable is fixed to one.   And we need to have that, because otherwise our model would not be identified.   So we will, by default, fix what your first loading to one in order to identify the model and be able to get all of your estimates.   An alternative would be to fix the variance of the latent variable to one, which would also help identify the model, but it's a matter of choice which you do.   Alright, so we have a model for realistic threat. And now I'm going to select those that are symbolic threat and I will call this symbolic and we're going to hit go ahead and add   our latent variable for that. I changed my resolution. And so now we are seeing a little bit less than what I was expecting. But here we go. There's our model. Now   we might want to   specify, and this is actually very important,   realistic and symbolic threats. We expect those to vary, to co-vary and therefore, we would select them in our from and to list and click on our double-headed arrow to add that covariance.   And so notice here, this is our full two factor confirmatory factor analysis. So we can name our model and we're ready to run. So let's go ahead and click Run.   And we have all of our estimates very quickly. Everything is here. Now   what I want to draw your attention to though, is the model comparison table. And the reason is because we want to make sure that our model fits well before we jump into trying to interpret the results.   So let's talk about what shows up in this table. First let's note that there's two models here that we did not fit but the platform fits them by default upon launching.   And we use those as sort of a reference, a baseline that we can use to compare a model against. The unrestricted model I will show you here,   what it is, if we had every single variable covarying with each other, that right there is the unrestricted model. In other words, is a baseline for what is the best we can do in terms of fit.   Now the chi square statistic represents the amount of misfit in a model. And because we are estimating every possible variance and covariance without any restrictions here,   that's why the chi square is zero. Now, we also have zero degrees of freedom because we're estimating everything. And so it's a perfectly fitting model and it serves as a baseline for...   serves as a baseline for understanding what's the very best we can do.   All right, then the independence model is actually the model that we have here as a default when we first launched the platform. So that is a baseline for the worst, maybe not the worst, but a pretty bad model.   It's one where the variables are completely uncorrelated with each other. And you can see indeed that the chi square statistic jumps to about 2000 units. But of course, we now have 45 degrees of freedom because we're not estimating much at all in this model.   And then lastly we have our own two factor confirmatory factor model and we also see that the chi square is large, is 147   with 34 degrees of freedom. It's a lot smaller than the independence model, so we're doing better, thankfully, but it's still a significant chi square, suggesting that there's significant misfit in the data.   Now, here's one big challenge with SEM. Back in the day was that people realize that the chi square is impacted, it's influenced by the sample size. And here we have 550 observations. So it's very likely that even well fitting models are going to have a little   Going to turn out to be significant because of the large sample size. So what has happened is that fit indices that are unique to SEM have been developed to allow us to assess model fit, irrespective of the chi square and that's where these baseline models come in.   The first one right here is called the comparative fit index. It actually ranges from zero to one. You can see here that one is the value for a perfectly fitting model and zero is the value for the really poor fitting model, the independence model.   And   I keep sorting this by accident. Okay, so what what this index for our own model means a .9395. It's the proportion, it's yeah, it's a proportion of how much better are we doing with our model   in comparison to the independence model. So this is just, we're about, you know, 94% better than the independence model, which is pretty good.   The guidelines are that CFI of .9 or greater is acceptable. We usually want it to be as close to one as possible and .95 is ideal.   We also then have this RMSCA, the root main square error of approximation, which represents the amount of misfit per degree of freedom.   And so we want this to be very small. It also ranges from zero to one. And you see here, our model has .07, about .08,   and generally speaking .1 or lower is good, is adequate, and that's what we want to see. And then on the ride, we get a some confidence limits around this one statistic. So   what this suggests is that indeed the model is a good fitting...is an acceptable fitting model and therefore we're going to go ahead and try and interpret it.   But it's really important to assess model fit before really getting into the details of the model because we're not going to learn very much, or at least not a lot of useful information if our model doesn't fit well from the get go.   Alright, so, because this is a confirmatory factor analysis, we're going to find it very useful to show (I'm right clicking here), and now I'm going to show the standardized estimates on the diagram. All right. This means that we're going to have   estimates here for the factor loadings that are in the correlation metric, which is really useful and important for interpreting these loadings in factor analysis.   This value here is also going to be rescaled so that it represents the actual correlation between these two latent variables, which in this case is   substantial point for correlation. In the red triangle menu, we also can find the standardized parameter coefficients in the table form so we can get more details about standard errors and Z statistic and so on.   But you can see here that all of these values are fairly...you know they're they're pretty acceptable. They are the correlation between the observed variable and the latent variable.   And they're generally about .48 to about .8 or so around here. So those are pretty decent values. We want them to be as high as possible.   And another thing you're going to notice here about our diagrams, which is a JMP pro 16 feature, is that   these observed variables have a little bit of shading, a little gray area here, which actually represents the amount or portion of   variance explained by the latent variable. So it's really cool, because you can really see just visually from the little shading that the variables for symbolic threats   actually have more of their variance explained by the latent variable. So perhaps this is a better factor than the realistic threat factor, just based on looking at how much variance is explained.   Now I want to draw your attention to an option called assess measurement model and that is going to be super helpful to allow us to understand whether   the questions, the individual questions in that survey are actually good questions.   We know that based on the statistical...the indicator reliability, we want our questions to be reliable and that's what we are plotting over on this side. So notice we give a little   line here for a suggested threshold of what's good acceptable reliability for any one of these questions and you can see in fact that the symbolic threat is a better   Seems to be doing a little better there. The questions are more reliable in comparison to the realistic threat. But generally speaking, they're both fairly good   You know, the fact that they're not crossing the line is not terrible. I mean, they're they're around...these two questions are around the the   adequate threshold that we would expect for indicator reliability. We also have statistics here that tell us about the reliability of the composite. In other words, if we were to grab all of these questions and   maybe grab all of these questions for a realistic threat and we get an average score for all of those answers   per individual, that would be a composite for realistic threat and we could do the same for symbolic.   And so what we have here is that index of reliability. All of these indices, by the way, range from zero to one.   And so we want them to be as close to one as possible because we want them to be very reliable and we see here that both of these   composites have adequate reliability. So they are good in terms of using an average score across them for other analyses.   We also have construct maximal reliability and these are more the values of reliability for the latent variables themselves rather than   creating averages. So we're always going to have these values be a bit higher because when you're using latent variables, you're going to have better measures.   The construct validity matrix gives us a ton of useful information. The key here is that the lower triangular simply has the correlation between the two factors in this case.   But the values in the diagonal represent the average variance extracted across all of the indicators of the factor.   And so here you see that symbolic threats have more explained variance on average than realistic threat, but they both have substantial values here, which is good.   And most importantly, we want these diagonal values to be larger than the upper triangular because the upper triangular represents the overlapping barriers between the two factors.   And you can imagine, we want the factors to have more overlap and variance with their own indicators than with some other   construct with a different factor. And so this right here, together with all of these other statistics   are good evidence that the survey is giving us valid and reliable answers and that we can in fact use it to pursue other questions. And so that's what we're going to do here. We're going to close this and I'm going to   run a different model, we're going to relaunch our platform, but this time I'm going to use...   I have a couple of variables here that I created. These two are   composites, they're averages for all of the questions that were related to realistic and symbolic threats. So I have those composite scores right here.   And I'm going to model those along with...I have a measure for anxiety. I have a measure for negative affect. And we can also add a little bit of the CDC   adherence. So these are whether people are adhering to the recommendations from the CDC, the public health behaviors. And so we're going to launch the platform with all of these variables.   And what I want to do here is focus perhaps on fitting a regression model. So I selected those variables, my predictors, my outcomes. And I just click the one-headed arrow to set it up. Now the model's not fully...   correctly specified yet because we want to make sure that both our.   Covid   threats here, we want to make sure that those are covarying with each other, because we don't have any reason to assume they don't covary.   Same with the outcome, they need to covary because we don't want to impose any weird restrictions about them being orthogonal.   And so this right here is essentially a path analysis. It's a multivariate multiple regression analysis. And so we can just put it here, multivariate regression, and we are going to go with this and run our model.   Okay. So notice that because we fit every...   we have zero degrees of freedom because we've estimated every variance and covariance amongst the data. So even though this suggest is a perfect fit,   all we've done so far is fit a regression model. And what I want to do to really emphasize that is show you...I'm going to change the layout of my diagram   to make it easier to show...the associations between both types of threat. I want to hide variances and covariances and you can see here, just...   I've hidden the other edges so that we can just focus on the relations of these two threats have on our three outcomes. Now in my data table, I've already fit a, using fit model, I used anxiety that same   variable as my outcome and the same two threats as the predictors. And I want to put them side by side because I want to show you that in fact   the estimates for the regression of the prediction of anxiety here are exactly the same values that we have here in the SEM platform for both of those predictions.   And that's what we should expect. Fit model, I mean, this is regression. So far we're doing regression.   Technically, you could say, well, if I just, I'm comfortable running three of those fit models using these three different outcomes, then what is it that SEM is buying me that's, that's better.   Well, it might, you might not need anything else and this might be great for your needs, but one unique feature of SEM is that we can test directly whether there are   equality...whether equality constraints in the model are tenable. So what we mean by that is that I can take these two effects (for example, realistic and symbolic threat effects on anxiety) and I can use this set equal button to impose an equality constraint. Notice here,   these little labels indicate that these two paths will have an equal effect. And we can then   run the model and we can now select those models in our model comparison table and compare them with this compare selected models option.   And what we'll get is a change in chi square. So, we see just the change of chi square going from one model to the next. So this basically says, okay,   the model is going to get worse because you've now added a constraint. So you gained a degree of freedom, you have now more misfit, and the question is, is that misfit significant?   The answer in this case is, yes. Of course, this is influenced by sample size. So we also look at the difference in the CFI and RMSEA and anything that's .01 or larger suggests that   the misfit is too much to ignore. It's a significant additional misfit added by this constraint.   So now that we know that, we can say directly in this...we know that this is the better fitting model, the one that did not have that constraint and we can assert that realistic threats have greater positive   association with anxiety in comparison to the symbolic threats, which also have a positive significant effect, but is not a strong significantly, not as strong as statistically, not as strong as the realistic threats. All right. And there's other   interesting effects that we have here. So what I'm going to do as we are approaching the end of the session is just draw your attention to this interesting   effect down here where both types of threats have different effects on the adherence to CDC behaviors. And the article really   pays a lot of attention to this finding because, you know, these are both threats. But as you might imagine, those who feel threats to their personal   health or threats to their financial safety, they're more likely to adhere to the CDC guidelines of social distancing and avoiding social gatherings, whereas those who   are feeling that threat is symbolic, a threat to to their social cultural values,   those folks are significantly less likely to adhere to those CDC behaviors, perhaps because they are feeling those social cultural norms being threatened. And so it's an interesting finding and we see that here we can of course test equivalence of a number of other paths in this model.   Okay, so the last thing I wanted to do is just show you (we're not going to describe this full model), but I did want to show you what happens when you bring together both   (let's make this a little bigger) regression and factor analysis.   To really use the full potential of structural equation models, you ought to model your latent variables, we have here both threats as latent variables which allow us to really purge the measurement error from those   survey items, and model associations between latent variables, which allows us to have unbiased effects because...unattenuated effect...because we are accounting for measurement error when we measure   the latant and variables directly. And notice, we are taking advantage of SEM, because we're looking at sequential associations across a number of different factors and so down here you can see our cool diagram which   I can move around to try and show you all the cool effects that we have and also highlight the fact that our diagrams are   fully interactive, really visually appealing, and very quickly we can identify, you know, significant effects based on the type of line versus non significant effects, in this case, these dashed lines. And so   again to really have the full power of SEM, you can see how here we're looking at those threats of latent variables and looking at their associations with a number of other   public health behaviors and with well being variables. And so with that, I am going to stop the demo here and   let you know that we have a really useful document in addition to the slides, we have a really great document that the tester for our platform, James Koepfler, put together where he gives a bunch of really great tips on how to use our platform,   from specifying the model, to tips on how to compare models, what is appropriate to look at, what's a nested model, all of this information I think you're going to find super valuable if you decide to use   the platform. So definitely suggest that you go on to JMP Community to get these materials which are supplementary to the presentation. And with that, I'm going to wrap it up and we are going to take questions once this all goes live. Thank you very much.
Heidi Hardner, Senior Staff Engineer, Seagate Technology Serkay Olmez, Senior Staff Data Scientist, Seagate Technology   A Keras /TensorFlow deep learning project motivated a set of JSL Add-Ins for working with images inside JMP data tables. In turn, the new tools revealed widespread applicability of embedded images for our more conventional JMP analyses. JMP has had the ability to incorporate images in tables for some time but we had not fully utilized it. We will share the value we have found in embracing this feature as well as a suite of scripted tools to help unlock that value. The JSL tools include pulling images into a table, reviewing, marking up and measuring images as well as exporting them to PowerPoint or to a file structure with organization and properties based on other data columns.     Auto-generated transcript...   Speaker Transcript Heidi Hardner Hello, I'm Heidi Hardner, speaking to you from Minneapolis, Minnesota. Serkay Olmez And this Serkay Olmez, and I'm dialing in from Longmont, Colorado. Heidi Hardner We work at Seagate Technology, where we both have a long history using JMP and JSL, and sharing JMP scripts with our colleagues. We're going to demonstrate some use cases from ??? images in JMP data tables and we'll share some related JSL tools.   At Seagate, recording heads undergo a large number of factory processes, from a wafer fab to installation in the drive.   We therefore, frequently look at very wide data where each head is a row and there are many, many columns.   Since many of the processes produce images that can include many different images per head or we might have very tall data from many heads passing through a single process, looking very similar to each other in metrology images.   Our download data sets today aren't Seagate images, but they'll have similarities to these situations, and I hope drawing these parallels may help you draw parallels to your own use cases.   JMP has had images and data tables for several years now, and I was personally slow to appreciate it.   I hope you'll see in these use cases, different ways it can be valuable, especially through the flexible interactive connection of images to other data.   I also want to highlight the value of storing image paths in the table, especially if you can have shared paths.   And lastly, we're both sharing the JSL we'll demo in the supporting materials. So that's a literal takeaway. There's a directory of those at the end of this PowerPoint.   Much of the exploration that's useful with images is quite basic and our scripts are ready to use without JSL knowledge, but they also contain a wide range of things you might build on in your own JMP scripts.   three things that include publicly accessible image paths through to my second takeaway.   The first is a data table from the Mets online catalog. It includes examples with multiple images per object or row in the data table.   Browsing a catalog was a huge rabbit hole for me that I highly recommend. The second data set is astronomy images. They're thanks to Russ Durkee at the Shed of Science Observatory.   This is row per image metrology data where the images are very similar to each other at first glance. So note the parallels to what I was just saying about our recording head factory data use cases.   And speaking of rabbit holes I've been down, the third thing is JSL that implements the Omeka API   to access scholarly collections online. This is thanks Dr. Heather Shirey for help accessing these databases. Her urban art mapping project, that's another fun thing you should go check out online.   But the technical point here is that anyone can use this code to experience querying the database and getting results that mix image paths with other data to make the kind of table that our tools are centered around.   So we're going to move right on to a live demo in JMP. Before we look at those type of data sets, what if you just have a pile of images?   image table from folders. And what this does is we could pick a folder. I have a folder with some sample images from the   Met catalog data. And you can see that it's got some folders and when I select this folder I quickly get a table with   in column all these tabs to the images. It's dug down looking for file extensions it thinks are images and it's also split out some folders for   the, you know, the subfolders as attributes in the data. Now fairly often, we have use cases where the folder names amount of some kind of meaningful attribute you'd like to have the data.   And this is a script you could customize further in my Seagate version, usually I have a head identifier embedded in the file name, so I have parsing for that that's built in automatically to my own personal version of this script.   So quickly, let's get over to one of our main example data sets. So this is the museum data that I mentioned. And so this is meatier, you can see that it has   it has a row per object, but it's got some numerical data. Right so it lists the objects. It's got a bunch of attribute data. So there's a bunch of...   a more substantial data set here. But then if we scroll over, you can see the inside of the data set are paths to the images, the URLs where they images   are hosted and you can see that they're up to five distinct images here for some of the rows.   And so my second of my tools image table important images is all that actually brings images into the table, and it can accommodate multiple paths. These could be local paths as well. They don't have to be a URL.   Once those are shared, the demo data, you'll be able to go to get... Tt's already done. I wanted to quickly say... to talk about quickly what it's doing behind the scenes.   When JMP open something that it recognizes an image, it can be an image object. And there are a bunch of JSL command messages that the image object can take, things like get size, get pixels, save image. And you'll see those in the in the code that we're sharing.   Let's go, I want to see the results. ...the images into the table. This script is doing a few things besides just sticking the images in.   It is...let me open another example to some other code snippet about this. So the images, they go in column, an expression type column, so it's making an expression data type, how to put the images in.   I'm also creating an event handler for that column. This is using a special message for the graph box, add image to the image object and get added right into a graph box. I will show you what that does, but it's, it's ... Click event handler.   And it's also resizing, so the fact that this got tall to accommodate these a little bit bigger.   That's something that you can do, my by script does that, but also by default, you know, if you put images in, you might get something really tiny like this, which could be cool   for seeing a large array, but maybe just worth knowing that, you know, that seems like, I don't know how I could ever look at my images are supposed to come in so small in JMP, you can widen them manually as well.   And the script also embeds itself in the table. That's a useful thing to know how to do, and the logic here is that, you know, the   data table got really big when we embedded all these images, and it's full images embedded in   the data for the full images embedded when this happens, and so I might want to send this data set to somebody and have them be able to go just get this without knowing anything about my special tool menu. They could run this to reimport the images.   On the On Click I wanted to show is where when you click on this, it makes a full size or a bigger version of the image here for you to see.   So like I said, much of what's great about having images in the data table turns out to be very simple JMP exploratory capabilities. So for example, I can take these columns and I can reorder them.   And just move them over by some other things will be more interesting to look at the whole file sizes table.   And this way, you know, I can just browse, I can sort, I can see here that I had the images sorted by how many, or the row sorted by how many images that work.   So if I scroll down, I can just see visually that, you know, by rearranging, getting different columns of interest next to these. I can actually get a lot of images in a grid.   on the table. Here, I have this group and made up my own grouping column. And so I could rearrange that. Just reordering the columns I could stick that over next to the images.   in case I wanted to review that by scrolling through. So a lot of just moving around, sorting, and rearranging is part of what's good about having images in a   table. So one of my personal use cases is monitoring some important metrics that come from images and in my case, they actually came from a combination of different images.   And I'd be looking at this data or looking for outliers and suspicious modes that would recurrently show up in the data.   And in this use case, I would have not 85, I have tens of thousands of rows I'm looking at and the images will be a useful part of exploring like what's going wrong with data.   Let me make an example plot here of just so...in this case height by width of these museum objects. And let me quickly go back and show one thing you can do. By right clicking here and I can make a label out of this first column.   And so then if I'm looking at the data, I can be examining this and say there's something kind of sticking out. Why is this so much wider than it is tall? Well, that wasn't a mistake. It's really a wide   artwork there, so that makes sense. So I can have, you know, one fact pop up. If I have a lot of different things I'm interested in sandwiched in this cluster down here, it seems like there's a lot   of these guys that are similar in size, maybe that seems like it's interesting.   Again, I can, I can look at them individually with the labels. But if my data is really large I could do other exploratory things. I can go back to the data and I could use F7 to page through   for things that have been selected. Of course, with having selected things I could just subset some of those to get, you know, to get   just a smaller data set. One of the exploratory things I'd like to show, one of my favorites, is under row selection.   I could do name selection and columns. So say this is a strange mode and already knows that I'd like to call it oddballs, doing this.   And I'm going to say, hey, these are some oddballs in the data I want to go examine. And it's going to give me a column at the end here with ones and zeros for whether things are oddballs or not.   And part of what's useful about that is it's just, let's just imagine that my data set is really big. And so just paging through them. Not so good.   And let's say it's so big, I don't want to have the images in there. I don't want to pull them in, it was unfeasible. So here I am back with my data set.   And I have these oddballs that I'm interested in. And I can go to subset, and instead of just subsetting all of them, say maybe there's thousands here in this group, I can use that column to do stratification.   So here I could say, like, I've only selected 11, so let's say I'm going to pick five. I'm going to choose to stratify on the oddballs.   And what I'll get is five oddballs and five nonoddballs, randomly chosen, and you know, here I can, I could sort those and I could bring the images back. So now I could say, all right, I just want to see,   maybe I just want to see one and two, whatever, just for the small set. So again, hopefully that's showing like how dragging along those paths even a really large data where it doesn't seem useful to   have the images or you know feasible to put in the images, I could just get...they'll get a subset. And then I could browse through these and say, what's different about the oddballs. I can sort them out.   Now I want to show one other use case. So   this use case was me learning to do some deep learning classification of images and so   in this case, you know...really forcibly struck me in this case that images really are literally the data.   And so here's an example of this. If I, in my case, of course, I'm looking at recording has different views of them. I was trying to classify these different views.   Right feel ??? and some or something else that are focused on a different part of the head. And so here I'm simulating that with us cat statue from the art data and categorize them. Here is the back, facing the right, facing front and so   you know, this, this, this whole project of doing machine learning isn't happening in JMP but there's a need there   for this Python ??? TensorFlow categorization product to provide training data. And in the end, to review the results that come back and it's, you know, ends up being the form of nested files in folders. And so let me quickly show another of our tools. This image   Here we go. Image table. These tools.   Two folders and here is...I can pick on the images.   These are the paths or images in the data table and I can choose some attribute columns and turn those into   file folders for nested file folder categories. Here I'm going to tell it to use my file name column as a name.   I can all generally have some prefix or change the size. Resizing is another thing that really came in this machine learning. I've set up my model certainly expect upload images of a certain size. A then when this operates, it's going to send my images   into the folder.   there's snafu here where it went.   Let me do it one more time. The images...image table to folders.   I think I clicked on the wrong thing. I've got my machine learning images folder here that I want to put these into quickly.   Revise that.   Clicking on the file names.   And it's going to arrange them in the folder.   So here I have my test and training data sets and oops, I picked I picked the wrong thing for my categories. Let's categorize these by the things that I chose in the columns.   So the idea there is that, you know, I need to put my, I need to put my categorized data into the file folder structure.   And the other thing about that is it's useful in the data in...to do this in JMP in the data table so it's useful to come up with these categories training.   groups inside JMP. I already showed that we could do the stratify and here's another example where, because I have the data, the other data columns, like the department, in the   in the data set. I can go back and see, like, hey, did I split up my test and training set in a balanced way? So when I have,say Egyptian art, I did a good job on Egyptian art   Being partly in the task, partly in the training, not so much on the other category. So in JMP I know I have the other data   columns. They're convenient for forming up my testing training set, making sure that I have balanced groups and also I can save this table and have a record of what was in the...   what was in the training that I did. So there's that. The last thing I wanted to say about this is that this museum data doesn't really capture very well   the idea of metrology images looking all very similar. So because these cats are all so unique, you might have to really look closely when you're scrolling through here to see,   you know, is this the back of a cat or the front with the face worn off or whatever else, and   I'll talk more about the case where metrology images really very similar to each other. And it's very useful potentially to scroll through them just looking at things by eye. After we hear from Serkay with his PowerPoint tool. Serkay Olmez Okay, so let me share my screen.   Alright.   So what I will be talking about is a script that enables people to push images listed in a in a JMP table into PowerPoint.   So what what you want to do is to collect the images from those links listed in the table and push them into PowerPoint with a template you choose. And so you can create this table with the   code snippets Heidi already showed and this this script assumes that you have a table already and you have the links in there.   And I will just run the script and tell you what the script does.   So if I run this, it gives you this interface in which you can select columns as attributes. First of all, you will need to select the image path so that JMP will know where to collect those images from and   you can also add labels. Those labels will go into PowerPoint as columns and you can also decide what kind of layout you want in the PowerPoint. So I will just show a grid   layout first. So there will be four columns, two rows in this case. And you hit export. What it does is, it goes and triggers PowerPoint and PowerPoint will   go and collect those images and it will build the slides and this will take 10 to 20 seconds because it's pulling in 20 images from a server.   So once it completes, you you will have the PowerPoint with a nicely laid laid out structure. And those columns, remember, those are the ones the table comes from the columns you selected at the beginning in the UI. So you get this one quickly and you can, of course, change   the layout here. For example, let's assume you want to do   a different thing. So there's this layout selection here. Let's say you want you want to do one row and maybe leave some space underneath so that you can put some comments in there. So you can just rerun this   and then JMP will tell PowerPoint how to set up the layout. And then once this completes, it will have a row for images per slide.   Look like this and   One other thing that script does is, is to split your slides into categories. Or maybe there's a there's a top attribute in your data and you want to start....you want to start a new slide   when that attribute changes. Let's, for example, take the medium and you put it into slide labels so it'll create slide labels at the top. It's like titles.   And it also split this little PowerPoint presentation into categories. So it will start a new slide every time this medium attribute changes. If I just do this,   it will do the same thing again in 10-20 seconds. Now, it will be grouping them into different categories. So, you will have a nicelu laid out   set of slides,   which which will be split. So this, for example, you have a slide for medium equals to brass, you will have a bronze one and there seems to be only one item from that. I selected ??? rows in the data and there will be ceramic, etc. So this separates out the slides.   The last thing I will show as a demonstration is that sometimes people will want to have a little more complicated layout. A template for example, they will want to look at one image   in a bigger way so they want to put a big image to the left, maybe a couple more to the right. So you can do such complicated templates   by going...by selecting a different layout and those are done by scripts. So I had this one big to small layout. There will be one big image on the left hand side and there will be two small ones stacked to the right. So if you just do the this   it'll go and do the same thing, but now it will be the different layout.   I'll give it 10 to 20 seconds to build the slides.   There you go. So a big image on the left hand side and two small ones on the right hand side.   And the last thing I will show is about customizing the tool. We will be releasing this tool as a part of this large add in that does all those image manipulation. And so if I just opened this one very quickly,   so this is user customizable, so users can go in and change what kind of columns they want to put into roles automatically. And so   users can make it look different for himself or herself. And one last thing I will mention is about the specialized templates. So those templates   I just showed one big to small one, those are created by scripts that live externally to the tool. So we will be providing two of them. But the idea is that if the user is experienced enough, that person can develop   his or her own template and put it into the tool and the tool will recognize it, or the developer can develop more and more templates   specified...specific to a particular customers or particular users. So, and this will be a part of the add in we're going to release. This is all I have. Heidi can take it back. Heidi Hardner Okay, okay.   Sorry about that. I'm back and   now we're looking at a different data set. These Shed of Science astronomy images.   And as advertised, these images are more...they are metrology images and you're looking at a piece of the night sky here in this very small image. And even in this very small size and scope of images,   you can see, you can see movement there. And you can see some anomalous things like here when something pops up.   And this is part of the point I wanted to make when I was looking at like classify for my machine learning project and I have images that nominally look very, very similar to each other. It's very easy to scroll through and see something that's that's anomalous like this.   But I'm actually going to make a more graphic demonstration of that where we're going to blow this up and scroll through again. So let me quickly   describe what we're going to see. So here's a view of the intensity of white intensity on these images as a function of time. So these are time ordered here when we scroll through them.   And one thing you're not going to see, that's not very visible by eye, is this big, slow trend here in the intensity.   We'll have the labels on. You can see here again that the way you use labels to see. Yeah, whether intensity is very high. It's because there's a streak across the 60-second exposures are something passing by in those images. We can also see though the bumps in the data. And that's where you see the   the telescope has drifted, a little bit of the background of this moves and it resets itself, see that when we scroll through. So just a little bit of a preview.   I'm going to show expanding this much bigger so that even inside the data table, we can see a quite big view of the images, just a little bit bigger.   And start back at the beginning in time. What I want you to do is watch this spot, watch that object as we scroll through the data.   And hopefully what we're seeing is you can see that moving to the left, against the background stars. And that's an asteroid, 2171 Kiev.   That is, we're seeing it was a real intention of these metrology images. So we just a graphic demonstration that within the right circumstances, just scrolling through images in the table can be used.   image table annotation. So this is for the opposite case where you really want to just examine images, one by one.   What happens when you click on this, is that, again, you have a choice of using the path or the embedded images.   You have the option to have some sort of identifier that we put the date here or it would just give you a row numbers and you can have one other attribute, let's put the telescope here,   on the data. And what this does is that it's forming a little markup tool and this tool will let you   do some generic classifications. This one's a three, could say something here. Some kind of comment about the data.   And also I have cross hairs, so I can do measuring. It's going to measure in pixels on the image and it's going to move the crosshairs to record the positions and the   delta there and pixels when I click record. Now, before we go do that, you know, part of the reason there's a lot of different ways you do mark up with commercial   tools, part of the motivation for doing this in JMP is, again, that you have, along with the image, all this other data in the table.   It's really just represented here by this one column. But again, this is something that is ripe for customization. In my own little version, I'm looking at images   and I'm printing here in this view, a little map of where the head was on the wafer or various other pieces of data. Or maybe the image recipe said what the dimensions that I can try to verify.   So the ability to have this joining between image and items there, is it possible even for doing that? But then when we click record what happens is, go over to the right. We'll see them to add columns and   (The columns are so tall, let's shrink them a little bit.)   You can see here in Row 3 and recorded. No, I don't want to move the white cross hairs so I'm getting a vertical distance and the positions of the crosshairs and my comments and classifications are stuck right back in the table in the correct row.   Another thing we're seeing here is another column of images. And so this is where I have put some images in that are JMP plots, JMP plots can be images.   And in the data table that we're sharing, I have embedded code. You'll see here that makes these images and it involves using get pixels message to the images that we have,   getting a matrix of the data, slicing out a slice along where that asteroid is moving and averaging up the intensity. You can flip through and see   the peak moving in these images. But the real point of this was for me to just mentioned.   Some of our use cases that we have at Seagate all the time where we'll have one row of data for a head and what values in there, such as a peak frequency that came from an entire spectrum or an optimum value that came from an optimization sweep etc and   those sweeps could actually become images in the data. And of course there are good reasons why we do that kind of feature extraction on images or complex curves.   The simple values like frequency can be used more directly in a variety in JMP analyses like   JMP's statistical platforms, but I hope that we both demonstrated that they're even a lot of useful things you could do if you get the images into the table in relation to a bunch of other data columns.   In particular, you know, I hope maybe seeing plots as images, you're thinking right back right away to Serkay's PowerPoint demo and plots are kind of things like to arrange in a PowerPoint in certain ways.   Anyway, hope the various bits of JSL we've shared might be useful to you and that you feel inspired to do more with images yourself. Thanks.
Ned Jones, Statistician, 1-alpha Solutions   Simulation has become a popular tool used to understand processes. In most cases the processes are assumed to be independent; however, many times this is not the case. A process can be viewed as physically independent, but this does not necessarily equate to stochastic independence. This is especially true when the processes are in series such that the output of a process is the input for the next process and so forth. Using the JMP simulator a simple series of processes are set up represented by JMP random functions. The process parameters are assumed to have a multivariate normal distribution. By modifying the correlation matrix, the effect of independence versus dependence is examined. These differences are shown by examining the tails of the resulting distributions. When the processes are dependent the effect of synergistic versus antagonistic process relationships are also investigated.     Auto-generated transcript...   Speaker Transcript nedjones The Good, the Bad, the Ugly or Independence and Dependence Synergistic and Antagonistic. I am Ned Jones. I have a small consulting business called 1-alpha Solutions. You can see my contact information there. Let's get into the simulation discussion. I'm going to be running the simulation, obviously, and JMP in the discovery... and allows you to discover model yet input random very...variation...model output random variation and from based on the inputs in any noise that you add into the simulator. Simulator also allows you to profile... is in the profiler and defines it defines the random input is defined based on random input that you have and you're able to run a simulation and produce output tables and simulate variables. Next thing I want to do is talk about the a couple of different types of simulation. Just a simple simulation. If you have one input and one output, there is no issue of dependence in the in the simulation. The ones that we're concerned about primarily are simulations, where we have multiple inputs. And we are simulating and will have one or more system outputs. The concern is that there could be dependence among the input variables. Scroll down here a little bit and we'll see...want to talk about what it means to to have stochastic independence. Two events A and B are independent, if and only if their joint probabilities equals the product of their probabilities. Well, that's what we want. That's the end result we want. I'm going to define it. Look at it a little bit differently and so forth. This should make it a little clearer. If we look at the intersection of A and B, events A and B, is equal to that joint product that implies that the probability of A is equal to the probability of A given B, or similarly, the probability...the with the joint probabilities, but the probability of B is equal to the probability of B given A. Thus the occurrence of A and B is is not affected. The occurence of B is not affected by the probability of A and vice versa. 2, 4, or 6. You can easily see that the the probability of A is 2 and 6 or one third, and probability of B is 3 and 6 or one half. But if you look at the intersection of A and B, that's 2. And so the and the probability of A times B as 1 6 and 2 is they get 2 is, and the only outcome you get so it's 1 in 6. Now, the probability of A is equal to the probability of any given B as equal to one third. And if we look at that and we realize that we're saying, okay, B has occurred, we know that we have a 2, a 4 or a 6, but there's still a one third chance that A could have occurred. So we can see that still, it stays at one third. And similarly, we see the same thing happening with the probability of B. Therefore, A and B are independent. Now I'm a role on in and look at the example I have and talk about that. What I'm doing is I'm simulating the pest load And the probability of a mating pair. What we have is we have a fruit harvest population and from that fruit harvest population, we're going to have some cultural practices that are applied in the... in our orchard Grove or vineyard to get a pest load...will have a pest load after those cultural practices are applied. Then we have to the harvest...the crop is harvested, we'll do some manual culling and will estimate a pest load there. And then after that we can...you can see that we have a cold storage and we'll have a pest load after the cold storage. We're going to try to freeze them to death. And the final thing we do is once we get this pest load here, we're going to break it up in a marketing distribution and split that population into several smaller pieces. And we'll be able to calculate the probability of a mating pair from that. Well, the problem, you can see immediately is that these things become very dependent because the output of the harvest population is the input for treatment A. The output of the treatment A is the input for treatment B and so forth on to C on down to the meeting pair. Now here's, here's a table I...here's a table we'll work with and we'll start with. And here is what we have is we set this up in for the simulator to work with and we have a population range of 1 million to 2.5 million fruit. We have a treatment here, a treatment range of the efficacy of mitigation that we're seeing. Here's the number of survivors we would expect from this treatment population and we have a Population A is a result of that. We're going to have a population B as a result of that, we can take a look at the formulas here that are used. And what is what is being done here, this a little differently is, I'm putting a random variable in the...that is going to go into the profiler. So the profiler is going to see this immediately as a random variable going in. So we're simulating the variable coming in, even before it comes in. So with that, you can go...we can go across. You can see the rest of the table. We're going over. We have another set, we have survivors that's after Treatment C, the same type of thing. Then we have this distribution and we had a probability of a mating pair. I'll show you that formula. It's a little different. The probability of a mating pair. Well, this is just using an exponential to estimate the probability of mating pair so you know what's going on. I haven't hidden anything from you behind the curtain and so forth. Let's take a look. So to open up the profiler, we're going to go to graph and down to profiler. All right. And then from there, we're going to select our variables that have a formula that we're interested in. So we're gonna have...we're gonna have the Pop A, Pop B, survivors and the probability of a mating pair. Going to put those in and we're going to say, uh-oh, we got to extend the variation here and we're going to say, okay. We got a nice result. Very attractive graphs here. And first thing you're going to see is, you're going to see squiggly lines in this profiler that if you use a profiler that you're probably not used to seeing lines like that. It's just a little different approach and so forth that you can see how these things work and Doing a little adjustment here so you can see the screens better. Now from this point what we're gonna do is we're going to open up the simulator in the profiler. We go up here and just click on the simulator and it gives us these choices down here. First thing I want to do with this is I want to increase the number of simulation runs to 25,000. Okay. And what I'm going to do...what we do if we have independence, one of the tests, quite often for that, we use for that is that we we'll look at the correlation. So I'm going to use a correlation here. Use the correlations and set up some correlations. So for this first population, I'm going to call it multivariate and immediately you can see we get a correlation specification down below. And we'll set up another multivariate here and another multivariate for treatment B. Another multivariate for for treatment A. Now what this is doing is, this is taking those treatment parameters that we had up above, we had before. And it is putting those in our multivariate relationship with each other. We also got the last thing, this marketing distribution. I don't want it to be continuous so I'm going to make it random and we're going to make it an integer. We'll make that an integer and we've got that run and we can see the results. Now this is the...I'll call this the Goldilocks situation with all the zeros down here, that implies that all of these relationships are completely independent and we can run our simulator here. And see the results. Do little more adjustment here on these axes. This come to life. Please look for here. Okay. Now you see those results. But what we're going to have here and look at this, is, we have the rate at which it's exceeding a limit that's been put in there. I put those spec limits back in the variables, but the one that I'm most concerned about is the probability of a mating pair. And wouldn't you know, I've run this real time and it hasn't come out exactly the way it should. Let's try a couple more times here. See. What we got the probability of a mating pair and that is supposed to be coming up as .5, but it certainly isn't. I have something isn't...oh, here, let's try this and fix this. This would be 4 and 14 Let's try the simulation one more time. Still didn't come up. Well, the example hasn't worked quite right, but in the previous example I was having .4 here. So that was saying, the rate was creating less than time but I'm having that probability is A .15 probability of a mating pair, but that's what happens sometimes when you try to do things on on the fly. So let me go up and I have a window that I can, we can look at that result with...we can look to that result. And let's...that has a little bit differently. And you can see now that that probability is under 5%. That's the target we're aiming for. in this thing, in this simulation. So if we go up and we can run those simulations, again you can see those bouncing around, staying under .5, so it's happening less than 5% of the time that the probability of a mating pair is greater than 1.5. Now because now, again, I'll say this is kind of a Goldilocks scenario because we're assuming all these relationships are independent. I have an example that I can show you that we have, where we have one that is antagonistic and synergistic. So I'll pull up the first one here and in this one we have that the relationships are antagonistic. Now when you...if you are are creating an example like this to work with it, you can't, at least I wasn't able to make everything negative. If you notice I have these two as being positive. This wants that the matrix to be positive definite. And it doesn't come out as positive definite if you set the...if you set those all to zero, but we can run that simulation again. And you can see that... you can see here that those simulations with a negative, it really makes things very, very attractive. We're getting a low, real low rate of... that we have...we have 1., .15 probability of a mating pair so that you can see just the effect. And what I really want to show is the effect of this correlation specifications, correlation matrix down here, covariance matrix that you specified. Now let's look at one other, we'll look at the one if it's positive. And we've got we've got an example here where it's positive. And you can see I have down here. I've said here. Now, I haven't been real heroic about making those correlations very high. I've tried to keep them fairly low and so forth to be fairly realistic, after all this is biological data. And we can run those simulations again and you can see very quickly that we're exceeding that 5% rate which is...becomes a great deal of concern here and so forth. And if you were...if most of the time these simulations like this are run with no consideration of the correlation between variables and that is kind of like covering your eyes and saying, I didn't see this and so forth. But it really if there is, if there is a correlation relationship and most likely there is, because one of these in...one of these outputs is the input to the next process, so pretty well has to be dependent, and what the dependencies are, estimating these correlations will be a great task to have to come up with most of the time. Work in this area is done based on research papers and they don't have correlations between different types of treatments so. But having some estimate of those is a good thing, a good thing to have. Now the next step is to show you the what else we can do here. We can create a table. And if we create this table and... Well, we'll create this table and I'm just going to highlight everything in the table out to the mating pairs here. And then I'm just going to do an analysis distribution. And run all of those and say, okay. Now we get all these grand distributions, fill up the paper with it. But what we can do is we can go in and we can select these distributions that are exceeding our limit out here. We can just highlight those and it becomes very informative as you look back and you can see the mitigations, what's happening, and so forth. What is affecting these things greatly. And one of the things that really ...first of all, our initial population, and this has been based on what we've seen in real life, is as the population gets to be higher, when we have large, large populations of the fruit, the tendency is that we have failures of the system, Treatment A and so forth. So what what the one that I thought was most interesting was, if we look back here and we look at the marketing distribution, That if we push them out, if we require that as shipments come into the country and that marketing distribution has to break these shipments up out into smaller lots to be distributed, the probability of mating pair pretty well becomes zero. With with these these examples and so forth, I want to go ahead and open it up for questions. But let me just say one last thing. I think of George Box. He was at one of our meetings a few years ago. And it was really interesting what...his two quotes that he said. He said, "Don't fall in love with a model." And he also said, "All models are wrong, but some are useful." I hope this information and these examples to give you something to think about when you're doing the simulation that you need to consider the relationship between the variables. Thank you.  
John Moore, Elanco Animal Health, Elanco Animal Health   Elanco has several custom JMP add-ins. Moore has created a system in JSL that allows:  •    Distribution of add-ins  •    Version control of add-ins  •    Tracking of add-in usage  •    Easy construction and publishing of add-ins  This is a complete redo of the add-in management system that Moore presented a few years ago. He has focused on making the new system more user-friendly and easier to manage.     Auto-generated transcript...   Speaker Transcript John Moore Hi, thanks for joining me. My name is John Moore and today and we'll be talking about Lupine, a framework I've developed for creating, publishing, managing and monitoring JMP add-ins. So first just a tiny little bit about me. My name is John Moore and my degrees are in engineering and management, but really I'm a data geek at heart. I've worked at Elanco Animal Health as an analyst for the last 19 years. I have one son, one husband, one dog, and at the moment, approximately about 80,000 bees. So, you know, how did we get here? Well at Elanco, we have TCs, which are technical consultants typically like vets or nutritionists spread across the globe, and made them use JMP and use JMP scripting. And many of these TCs only have intermittent access to the internet because just because they're so remote. So we kind of had a state where we had ad hoc scripting passing from person to person. There was no version control. There's no idea of who is using what script. And getting script out was difficult, updating scripts was difficult. So I created Lupine to try to get us to a state where we had standardized scripts, we had version control, we could easily know who was doing what, who's using which scripts. We could easily distribute updates and as an added bonus, we threw in a couple of formulas into the Formula Editor. So let's just take a look at Lupine here. Okay, so once you install Lupine, you'll see if there's not much to it. There's just one Lupine menu up on JMP and you only have check for update. When you look at check for updates, you'll see it'll list all of the add ins that are controlled by Lupine. And if it's not up to date, it'll say up to date. If it's up to date. It'll just say, okay, so it's very simple. Just to say, here are the add ins that are available. Do you have the most recent version or not? Also in the Formula Editor, just because we found ourselves using these particular formulas a lot, you know, like, month/year, which returns the first day of the of the month of the of the date has passed to it. We have year quarter, we have season. Also we have counter week where you can do things like week starting, week ending, doing those sorts of things. Also in Lupine we have usage logging. So every time a user uses a script managed by Lupine, we create a log file. So we know who used what script and what add in at what timestamp. And this really helps us figure out which of our scripts are useful and which of them aren't. So let's say you want to set up Lupine in your organization. Okay, first step we're going to download some zip files. We have a Lupine zip file, a Lupine manager, and Lupine template. And we also have Lupine example. So I suggest that you find one folder someplace and just download the three of those there. The next thing you need to decide as a manager setting this up is, where am I going to put my network folders? There are two network folders that are really important for Lupine. The first is, you're add in depot folders. So this is a folder that's going to contain all your add ins, and the PublishedAddins.jmp file which contains metadata about those add ins. The other one we need to think about is our central logs folder. So this is the folder that will contain all of the uploaded log files that people create every time that they execute a Lupine managed script. So you can see here, we have an example of all the add ins we have in our published add ins file. And here's just an example of the log files. So every log file has a unique ID. And we'll talk about this later, but whenever we go in and grab those, we only grab the new one so you don't have to download everything, every time. So once you've decided where you want to put those, then we need to tell Lupine where those are. And the way we do that is to go into that Lupine folder, after you've unzipped it, and under the add in management subfolder, you'll see this file called setup and this is the part to look for in setup. So here is where you define our add in depot path and our central logs path. Now the in production, you'll need to have these the network locations, but just for this little simple example, I created these just on my own machine here, but this is where you would define your network path. Once you've done that, now is the time to build Lupine for your own organization. So in the Lupine folder, you want to grab everything except if you have a gitattributes file in there. I use GitHub Desktop for managing my scripts. I strongly recommend it. It's made my life so much better. But you don't want put those in your script, so grab everything but the gitattributes file. You're going to right click on those, say send to and send those to a compressed zip folder. This will create a zip file of all those. Now JMP add in files are really just zip files. So what we're going to do is rename that Lupine.jmpaddin. And once you name it, you just have to double click on that thing, and it'll install it. Then you'll have Lupine on your machine. Now we're not gonna do anything with it yet, but it's there. If you look up on your menu bar, you'll see Lupine. Next, let's talk about Lupine Manager. This is a separate add in that's designed for the person who's administering, and keeping track of, and updating all the add ins within an organization. So this is probably not going to be on your typical user's machine, just because it's not useful. So what it has is a collection of tools that will help your...you as a manager manage the add ins. The first thing you do when you open up Lupine Manager is just go down to setup and then we're going to say our unique users file, which gets a all the unique users in your organization, and log data. OK, so the unique user table links user IDs. So we're grabbing user IDs from your computer's identity, but usually those don't say anything too particularly meaningful. Like for instance, mine is 00126545. We'd like to link that something that says John Moore. So that's what user unique users does. The log data file has a row for each log. So this is a summary of all those log files that we created before. So we can analyze these to our heart's content. Okay, so let's say you want to start using Lupine Manager. First thing you need to do is tell it which add ins you want to manage. To do that, we're going to go to build add in, then head on down to manage available add ins. And then this will be blank. When you get to it, click Add add in, and you can select the folders that contain the files for the add ins you want to put under control. Once you do, you'll see that those will be listed in the available add ins, click on build add ins. Okay, so you have two options here. Build add in is just going to create an add in file in the folder where the scripts are, so this is great for, like, testing. I want to do a quick build so I can test to see if it works. Build and publish will do the same thing, but also it will take that add in, copy it to your add in depot folder and update your published add ins file. So this is when you're ready to distribute it to the company, you've done your testing, it's ready to go. Now if you want to import all that log data, all you have to do is go to import and process log data and it will bring in all the new log files and add them to your log data file. We have a user summary here, which is just a quick tabulate to say who's using which add in. So this gives you a really quick view of who are my biggest users of which add in. You know here we can see, for this example, Lupine and LupineManager used a lot; LupineExample1 some; LupinExample, not so much. Okay, so let's say you've got Lupine installed, you've got Lupine Manager installed. Now you have some add ins that you would like to manage under Lupine. So let's talk about the steps for that. First thing you need to do is go out to that add in template, Lupine template file, we have and copy that someplace. So this contains all the necessary scripts and all the things that make Lupine work with your add in. So go ahead and copy that. Next we're going to kind of work through some steps here. The first is we're going to edit the addin.def file. We're going to create all the scripts that you're going to have in your add in or update them if you already have the scripts. Build the add in will customize the menus, so that the menu looks right. Assuming you click on it, it looks like the way you want it to. We'll build the add in again, once we get the menu fixed so that we can test it, make sure it's right. After we're done testing, we can publish it. So let's talk about this addin.def file; addin.def is really a tiny little text file, but it's required for every add in. This contains the core important information about the add in, what's its ID and what's its name. So you'll need to go in and edit that to change it to what the name of your add in will be. And this is a one-time thing. Once you've done this, you shouldn't have to change this for the add in going forward. So this is just once...you do it once when you set it up. Next, you need to decide which script am I going to put in this add in? Now I've created a, Lupine make log file. And this is what's going to create that log file for any add in that's under using Lupine. So this is what actually creates logs and allows you to do the version monitoring. So I recommend putting this header at the end...at the beginning of all of your files. Your script you're going to use, because that's what's going to allow you to monitor things. Now, so you've got all your scripts in there, next thing you do is make the menu file look the way you want it to. Right now, it's going to just say templates. So we're going to build this thing first. So just like we built Lupine before, where we did the select everything except the gitattributes files, right click, zip, change the name. We're going to do the same thing. And then we'll have the wrong menu, because right now, it just still has that template menu. So we need to go in and update that template menu to do what we want to do. Okay, so once you've installed it, so, you know, you'll kind of see your new script up here, your new add in up here, template. We're going to right click up on the menu and you'll see this long list. And way at the bottom, we'll see customize and all we want to do is go to menus and toolbars. Okay, once we're there you'll see something like this. We, the first thing we need to do is tell JMP, well, what do we want to edit. We want to edit the menu that goes with our particular add in, which right now is called template. So we'll go up to this change button. Click on that and we have a whole host of choices here. But what we want to do is go to click on JMP add ins radio button here and then go down to the add in we're interested in. Okay. Now here I just left this as template, but if you...since you've already changed it in the addin.def file, it will be what that is. Once you do that, what you're going to see is the bits and pieces that belong to your add in are going to show up in blue on the left here. So this is what we're going to edit and change to make it what we want for our new add in. So let's take a look at this. I included the About in the What's New as a standard scripts within this, but when you get it, what you'll see is it's pointing to this template add in, which isn't what you want. You've got your own add in name that you want. So what we're going to do is click on the run add in this file. Click on this use add in folder and then select this add in that you want here. So that's going to point JMP to the new scripts that you just created. And then we can click Browse. And then we can browse and navigate and go to that particular script we want and say, "No, no, I don't want to use it in the template one. I want you to use it in my new add in." Likewise, we can do the same thing with the what is new item in the menu. And once you've done those, really it's just putting in the rest of them we want. If you do a right click over on the blue area, you'll see these options for insert before, insert after, where you can add things or delete things from your menu. So you can add things in, most of them are going to be commands. You can also do a submenu so it can go deeper, or you can put a separator where it makes it nice little line across there. And so you're going to build your menu, put all the items that you...for the scripts that you just created in there. And then we're going to save it, but we're going to save it twice. When you save it here, what that's going to do is just change it in the version of the add in that's installed on your machine. And that's good, but what we would also need to do is save it to the folder where you're building the add in. So we need to save it there so the next time we build the add in, it has the right menu file. So after we click save, we're going to click save as and go to the folder that contains our add in files. Okay, so we have our scripts. We have our addin.def defined, we've got our menu worked out. The next thing we need to do is actually build this thing and test it. So typically, I'll send it out to users. They can tell me what's working, what's not working. After that, I can go in and actually publish it. So when you go in and publish, it's going to prompt you to say, update the information. The most important thing, and this is in Lupine Manager, the most important thing is to change the release date. So this is what lets people know that there's a more current version available, right. So it...what's going to happen is this gets compared to the date tha't published, and that's what's to Lupine is going to use and say, hey, there's a more current version for you to upload. So you can also add notes to it like revision notes. These are the things that I changed this time. These are the great updates and things that I did to my add in. And once you do that, if you click Publish, then Lupine Manager will take a copy of that add in, put it in the folder that creates the add in, where you have the source code. It will also publish it to the add in depot and it will update the published add ins file to have the most recent version, which most importantly, your release date in it. And then when, if you, if a user were to click on check for updates, it would say, hey, this add in has a new version available, would you want to download it? They can click on that and it'll install it, it'll be on your machine...on their machine. Okay, that was a really brief introduction. I hope there's enough material in here for you to do this yourself. If not, please contact me at john_moore@elanco.com. I'm happy to help you set this up. Many thanks, and thanks again. Bye bye.  
Fred Girshick, Principal Technologist, Infineum USA, L.P.   Wherever there are moving parts, surfaces come into contact and need to be lubricated. Development of lubricants, particularly engine oils, relies on fundamental chemical knowledge, applied physics, bench-top experiments and small-scale fired engine tests; but the ultimate — and only certain — proof of performance are full-scale field tests in the engines under actual operating conditions. The results of these tests, for example, oxidation of the hydrocarbon oil, are inherently non-linear. After a brief introduction of engine oil characteristics and parameters, this paper will present several examples from passenger cars, heavy-duty trucks, railroad, stationary natural gas and marine engines where the JMP non-linear platform and graphing capabilities were used to differentiate performance of engines and engine oils. Both single-variable and categorical cross variable models are used.     Auto-generated transcript...   Speaker Transcript Fred Hello and thank you for inviting me to present at Discovery Summit Americas 2020. My name is Fred Girshick and I'm going to talk about lubricant research using JMP non-linear regression. So my agenda, I'll do some introductions about myself and my technical field, a little bit about the background of lubricants research, the types of questions we want to answer using statistical tools, give some examples of nonlinear models, show how I do a nonlinear analysis, and then talk about conclusions and plans for the future. So introductions, that's the building where I work. So who am I If any of you shouted out 24601, you get extra credit at the final exam. My name is Fred Girshick. I'm a researcher for a specialty chemical company, you can see our logo in the upper right hand. We are a manufacturer of chemical additives for lubricants and fuels. We do have a global statistics group for help with more complicated situations, but we encourage researchers to do their own statistical analysis or get closer to the data, understand better the assumptions and the sources of error. My experience is with various forms of engine oils for reciprocating internal combustion engines passenger car, on-highway trucks, off-highway construction equipment, railroad, aviation, stationary engines. And my specialty for the past 19 years is large engines and large is a relative term. So, later on, I'll show you what I mean when I say a large engine. I was a previous user of SAS. I'm currently user of JMP. I don't consider myself to be a sophisticated user I might be intermediate. I tend to do the same type of analyses over and over again because I'm generally asking the same questions over and over again, I, I'm still learning. I'd like to expand and become more sophisticated. I am better at Microsoft Excel than I am at JMP, so I use Excel to prepare the data set to import into JMP, and I very often export the results to Excel, just because I'm more adept at the graphing tools and then export to PowerPoint for customer presentation or into Word for ....for a technical report. I'm not going to do a real time live action demonstration getting into JMP and going through it. So I just have screenshots and I'll be pointing to now I would do this now, I would do that. engines, transmissions, gears, pumps, motors. Lubricants can be solid, liquid or gas. I concentrate on liquid engine oils. detergent, dispersant, and antioxidant, etc. Now within each of those additive types there are many different chemical options, so detergent isn't one thing, it's a family of things. And of course, not all engine oils contain all additive types. You only use what's needed for efficiency. Today's talk is going to focus on liquid lubricants, engine oils, for reciprocating internal combustion engines (RICE). That's what's generally in your car. detergent, dispersant, antioxidant, anti-wear, friction modifier. One example of each. So detergent might be calcium sulphonate. polyisobutylene succinic anhydride polyamine (PIBSA/PAM). Antioxidant is...might be a hindered phenol. Anti-wear might be zinc dialkyl dithio phosphate (ZDDP) complicated molecule. And friction modifier, something like glycerol dioleate. You'll notice all of these have a more polar part of the molecule and a less polar part of the molecule. In general terms, the polar part is what gives it its function and the non polar part is what makes it soluble in petroleum oil. And now I'll tell you what I mean by large engines. Here's an example of one of our laboratories So large engines. First, what is not a large engine. So this is a car. It's my wife's car. It has a reasonably powerful engine of 200 horsepower. That's not a large engine. What about trucks going down the highway, you know, making power, pulling freight. No, they are about 500 horsepower. That's not a large engine. My examples of large engine is a railroad locomotive engine. So here's a picture of a locomotive engine being installed in locomotive. The red outline is the engine and there is the person who's installing it, gives you an idea of the size. Stationary natural gas engines, which today's talk is going to be about. Just get my laser pointer. So here's an example of one of those 5900 horsepower. The green thing. And then at the end is the person working on it. And another example of a large engine is a marine engine, ships at sea. And here's a marine engine. This particular one generates 98,000 horsepower and there is the person working on it. So this is what I mean when I say large engines. They really are large. The types of questions we want to answer in lubricants research, things like how well does this bench test predict real real-world performance, you know, if I'm doing screener tests, rig tests. How does performance depend on concentration of an additive? How much do I need to put in? How does the structure of the additive effect performance? If I make that hydrocarbon chain longer or more branched? If I change the ratio of the polar part to the non polar part? Can I predict performance from composition, just looking at the molecule can I predict what it will do? How long will the product last before it needs to be changed? So in your car, you're told to change your oil every 3000 miles or 5000 or 7500 or 10,000, depending who you ask. How much better is my premium product than my main line product? Is there a differentiation between those? Is there a value proposition? How does my product compare to my competitor's product? So, these are the sorts of things we do. Now, let me talk about nonlinear. What do I mean by nonlinear. So in the equation world, in the statistics world, nonlinear means nonlinear in the parameters, not in the variables. Okay, so let's play a very fast game of linear or no linear... nonlinear. So y equals mx plus b. This is the classic linear equation. If you don't do anything else you know that. Well, what about a quadratic. Now I have x squared. And that's not linear. That's quadratic, but a, b, and c are the parameters. They are linear. That's also a linear equation. What about cubic? Same thing. Any polynomial is going to be linear in the parameters. So it's linear regression. a times e to the bx. So I have a pre exponent...exponential a, and I have b in the exponent, but this equation can be easily rewritten as log and now it's a linear equation again. So I would just regress log of y against x. Y equals A plus B x to the C. So this is not e to the x. It's x to the something. And this is nonlinear. So this is an example of a non linear equation. Y equals A over x plus b. Okay. A and B are in different places. There's a sum, but again I can rewrite this, if I take reciprocals. So I would regress 1 over y against x and then this becomes a linear situation. Y equals A times x over b plus x. This is a nonlinear. This is very classical Michaelis-Menten, something to do with biology and enzyme kinetics. And then here's a sort of complicated equation. This is the equation we're going to be using today. And this is distinctly nonlinear. So the example I've chosen for today is a natural gas engine oxidation, so in service, the oil oxidizes. This is a picture of one of the engines that we did the test in. This is a compressor station. So the blue outline is the engine. The red outline is the compressor that it's driving, the pump, and that rather large structure highlighted in green in the back, that's the radiator. So in general, oxidation for our purposes, oxidation is degradation caused by reaction with oxygen. Now, strictly speaking in chemistry class you learn oxidation can occur without oxygen. It has to do with loss of electrons, but that's not what's going on here. Common examples are when apple slices turn brown or old milk goes sour. As we said before, engine oils are mostly hydrocarbon molecules. They are exposed to high temperatures during engine operation. Fuel combustion generates free radicals that get into the oil and promote oxidation. Free radicals are molecular fragments with unpaired electrons. They are unstable and reactive and they attack other molecules to pair their electrons. During fuel combustion molecules are just blown apart to form these fragments Oxidation of the engine oil leaves to undesirable consequences like oil thickening. So higher viscosity than the engine design needs also lowers fuel economy, because you have to push around a thicker liquid. Oxidation forms acids and acid can corrode metal parts. Oxidation causes deposits and the deposits can block oil passages and starve the engine from lubricating oil or impede moving parts, just sticking things together. Oxidation in our field is often measured by infrared and the units are absorbance per centimeter. Engine manufacturers publish limits at which the oil must be changed. So when oxidation reaches a certain point, you're required to change the oil to maintain your warranty. In the test design I'll be talking about today, there were two different natural gas engine manufacturers, which I've just called X and Y to not name names and make it a generic There were three oils, which I've called blue, red and green, or a, b, and c for this example. And it was a 3x2 design, so each oil was in each engine design. The total...it was run for about 14 months which is 10,000 or 11,000 operating hours. That's about 15 months 14-15 months of continuous operation at greater than 95% of maximum load. The engines were only shut down when it was necessary to do an oil change. Now if you think about that period of time, if you, if you drove your car near its maximum engine output, and let's say average 70 miles an hour, that'd be 700,000 miles of a test. Now for today's example I'm only going to be looking in detail at one of the engine models, the one I've called x. Oil samples were taken every week to 10 days, so there's total of about 600 samples. Many parameters were mentioned about 20 oil properties. Also we examined the engine, made physical measurements of wear, physical measurements of deposits. I'm only going to look at oxidation for now. So here's, here's the data set. It's been truncated. I've simplified it to only show what will need for today. So the variables of color is the oil formulation, the blue, green, and red. Oil is variable, that's the age of the oil in hours. So that's the hours since the last oil change, whereas test is the hours accumulated, the total since the test started, and Ox is oxidation in these units absorbance per centimeter. Now you'll notice the rows, I've color coded the rows according to the color. So all the oil what I've called blue oil, I've colored the markers in blue and red in red, etc. And I've also assigned shapes. So the x manufacturer has a square and the y manufacturer has a circle. Now, as I say, the full data set is much more extensive than this. Okay, so let's take our data set. plot the data. It's always the first thing you do. So in JMP, so what I'll be doing, again, I'm not doing a live JMP demonstration. So I'll show the action On the left hand words and that arrow means it's an action and then there will be a red circle showing where you do it. So use the graph platform, a box comes down, Graph Builder, and that brings you to this box, Graph Builder. So take oil to be the x variable. So you just click on that and drag it down here. And in the absence of a Y, it shows you like a histogram. And then ox is our dependent variable; ox, the y variable and you get this Now there's an obvious outlier. Even I can tell that. We'll get rid of that later. So it looks pretty messy. So I'm going to separate the two different OEMs, original equipment manufacturer. So if I take OEM and put that into overlay, now I have blue and red just JMP picks the colors. So it turns out the blue is the x and red is the y. And you can see the two different engine models behave very differently. Blue obviously has a shorter lifetime; red has a longer lifetime lifetime of oil. So for today, we're only going to look at a...  
Stephen Czupryna, Process Engineering Consultant & Instructor, Objective Experiments   Manufacturing companies invest huge sums of money in lean principles education for their shop floor staff, yielding measurable process improvements and better employee morale. However, many companies indicate a need for a higher return on investment on their lean investments and a greater impact on profits. This paper will help lean-thinking organizations move to the next logical step by combining lean principles with statistical thinking. However, effectively teaching statistical thinking to shop floor personnel requires a unique approach (like not using the word “statistics”) and an overwhelming emphasis on data visualization and hands-on work. To that end, here’s the presentation outline:   A)    The Prime Directive (of shop floor process improvement) B)    The Taguchi Loss Function , unchained C)    The Statistical Babelfish D)    Refining Lean metrics like cycle time, inventory turns, OEE and perishable tool lifetime E)    Why JMP’s emphasis on workflow, rather than rote statistical tools, is the right choice for the shop floor F)    A series of case studies in a what-we-did versus what-we-should-have-done format.   Attendee benefits include guidance on getting-it-right with shop floor operators, turbo-charged process improvement efforts, a higher return on their Lean training and statware investments and higher bottom line profits.     Auto-generated transcript...   Speaker Transcript Stephen Czupryna Okay. Welcome, everyone. Welcome to at the corner of Lean Street and Statistics Road. Thank you for attending. My name is Stephen Czupryna. I work as a contractor for a small consulting company in Bellingham, Washington. Our name is Objective Experiments. We teach design of experiments, we teach reliability analysis and statistical process control, and I have a fairly long history of work in manufacturing. So here's the presentation framework for the next 35 odd minutes. I'm going to first talk about the Lean foundation of the presentation, about how Lean is an important part of continuous improvement. And then in the second section, we'll take Lean to what I like to call the next logical step, which is to teach and help operators and in particularly teach them and helping them using graphics and, in particular, JMP graphics. And we'll talk about refining some of the common Lean metrics and we'll end with a few case studies. But first, a little bit of background, what I'm about to say in the next 35 odd minutes is based on my experience. It's my opinion. And it will work, I believe, often, but not always. There are some companies that that may not agree with my philosophy, particularly companies that are, you know, really focused on pushing stuff out the door and the like, probably wouldn't work in that environment, but in the right environment, I think a lot of what I what I say will work fine. All the data you're about to see is simulated and I will post or have posted detailed paper at JMP.com. You can collect it there, or you're welcome to email me at Steve@objexp.com and I'll email you a copy of it or you can contact me with some questions. Again, my view. My view, real simple, boil it all down production workers, maintenance workers are the key to continuous improvement. Spent my career listening carefully to production operators, maintenance people learning from them, and most of all, helping them. So my view is a little bit odd. I believe that an engineer, process engineer, quality engineer really needs to earn the right to enter the production area, need to earn the support of the people that are working there, day in, day out, eight hours a day. Again, my opinion. So who's the presentation for? Yeah, the shortlist is people in production management, supervisors, manufacturing managers and the like, process engineers, quality engineers, manufacturing engineers, folks that are supposed to be out on the shop floor working on things. And this presentation is particularly for people who, who, like who like the the view in the in the photograph that the customer, the internal customer, if you will, is, is the production operator and that the engineer or the supervisor is really a supplier to that person. And to quote, Dr. Deming, "Bad system beats a good person, every time." And the fact is the production operators aren't responsible for the for the system. They work within the system. So the goals of the presentation is to help you work with your production people, \  
Sam Edgemon, Analyst, SAS Institute Tony Cooper, Principal Analytical Consultant, SAS   The Department of Homeland Security asked the question, “how can we detect acts of biological terrorism?” After discussion and consideration, our answer was “If we can effectively detect an outbreak of a naturally occurring event such as influenza, then we can find an attack in which anthrax was used because both present with similar symptoms.” The tools that were developed became much more relevant to the detection of naturally occurring outbreaks, and JMP was used as the primary communication tool for almost five years of interactions with all levels of the U.S. Government. In this presentation, we will demonstrate how those tools developed then could have been used to defer the affects of the Coronavirus COVID-19. The data that will be used for demonstration will be from Emergency Management Systems, Emergency Departments and the Poison Centers of America.     Auto-generated transcript...   Speaker Transcript Sam Edgemon Hello. This is Sam Edgemon. I worked for the SAS Institute, you know, work for the SAS Institute, because I get to work on so many different projects.   And we're going to tell you about one of those projects that we worked on today. Almost on all these projects I work on I work with Tony Cooper, who's on the screen. We've worked together really since since we met at University of Tennessee a few years ago.   And the things we learned at the University of Tennessee we've we've applied throughout this project. Now this project was was done for the Department of Homeland Security.   The Department of Homeland Security was very concerned about biological terrorism and they came to SAS with the question of how will we detect acts of biological terrorism.   Well you know that's that's quite a discussion to have, you know, if you think about   the things we might come back with. You know, one of those things was well what do you, what are you most concerned with what does, what do the things look like   that you're concerned with? And they they talked about things like anthrax, and ricin and a number of other very dangerous elements that terrorists could use to hurt the American population.   Well, we took the question and and their, their immediate concerns and researched as best we could concerning anthrax and ricin, in particular.   You know, our research involved, you know, involved going to websites and studying what the CDC said were symptoms of anthrax, and the symptoms of   ricin and and how those, those things might present in a patient that walks into the emergency room or or or or takes a ride on an ambulance or calls a poison center or something like that happens. So what we realized in going through this process was   was that the symptoms look a lot like influenza if you've been exposed to anthrax. And if you've been exposed to ricin, that looks a lot like any type of gastrointestinal issue that you might might experience. So we concluded and what our response was to Homeland Security was that   was that if we can detect an outbreak of influenza or an outbreak of the, let's say the norovirus or some gastrointestinal issue,   then we think we can we can detect when when some of these these bad elements have been used out in the public. And so that's the path we took. So we we took data from EMS and and   emergency rooms, emergency departments and poison centers and we've actually used Google search engine data as well or social media data as well   to detect things that are you know before were thought as undetectable in a sense. But but we developed several, several tools along the way. And you can see from the slide I've got here some of the results of the questions   that that we that we put together, you know, these different methods that we've talked about over here. I'll touch on some of those methods in the brief time we've got to talk today, but let's let's dive into it. What I want to do is just show you the types of conversations we had   using JMP. We use JMP throughout this project to to communicate our ideas and communicate our concerns, communicate what we were seeing. An example of that communication could start just like this, we, we had taken data from from the EMS   system, medical system primarily based in North Carolina. You know, SAS is based in North Carolina, JMP is based in North Carolina in Cary and   and some of them, some of the best data medical data in the country is housed in North Carolina. The University of North Carolina's got a lot to do that.   In fact, we formed a collaboration between SAS and the University of North Carolina and North Carolina State University to work on this project for Homeland Security that went on for almost five years.   But what what I showed them initially was you know what data we could pull out of those databases that might tell us interesting things.   So let's just walk, walk through some of those types of situations. One of the things I initially wanted to talk about was, okay let's let's look at cases. you know,   can we see information in cases that occur every, every day? So you know this this was one of the first graphs I demonstrated. You know, it's hard to see anything in this   and I don't think you really can see anything in this. This is the, you know, how many cases   in the state of North Carolina, on any given day average averages, you know, 2,782 cases a day and and, you know, that's a lot of information to sort through.   So we can look at diagnosis codes, but some of the guys didn't like the idea that this this not as clear as we want want it to be so so we we had to find ways to get into that data and study   and study what what what ways we could surface information. One of those ways we felt like was to identify symptoms, specific symptoms related to something that we're interested in,   which goes back to this idea that, okay we've identified what anthrax looks like when someone walks in to the emergency room or takes a ride on an ambulance or what have you.   So we have those...if we identify those specific symptoms, then we can we can go and search for that in the data.   Now a way that we could do that, we could ask professionals. There was there's rooms full of of medical professionals on this, on this project and and lots of physicians. And kind of an odd thing that   I observed very quickly was when you asked a roomful of really, really smart people question like, what what is...what symptoms should I look for when I'm looking for influenza or the norovirus, you get lots and lots of different answers.   So I thought, well, I would really like to have a way to to get to this information, mathematically, rather than just use opinion. And what I did was I organized the data that I was working with   to consider symptoms on specific days and and the diagnosis. I was going to use those diagnosis diagnosis codes.   And what I ended up coming out with, and I set this up where I could run it over and over, was a set of mathematically valid symptoms   that we could go into data and look and look for specific things like influenza, like the norovirus or like anthrax or like ricin or like the symptoms of COVID 19.   This project surfaced again with with many asks about what we might...how we might go about finding the issues   of COVID 19 in this. This is exactly what I started showing again, these types of things. How can we identify the symptoms? Well, this is a way to do that.   Now, once we find these symptoms, one of the things that we do is we will write code that might look something similar to this code that will will look into a particular field in one of those databases and look for things that we found in those analyses that we've   that we've just demonstrated for you. So here we will look into the chief complaint field in one of those databases to look for specific words   that we might be interested in doing. Now that the complete programs would also look for terms that someone said, Well, someone does not have a fever or someone does not have nausea. So we'd have to identify   essentially the negatives, as well as the the pure quote unquote symptoms in the words. So once we did that, we could come back to   JMP and and think about, well, let's, let's look at, let's look at this information again. We've got we've got this this number of cases up here, but what if we took a look at it   where we've identified specific symptoms now   and see what that would look like.   So what I'm actually looking for is any information regarding   gastrointestinal issues. I could have been looking for the flu or anything like that, but this is this is what the data looks like. It's the same data. It's just essentially been sculpted to look like you know something I'm interested in. So in this case, there was an outbreak   of the norovirus that we told people about that they didn't know about that, you know, we started talking about this on January 15.   And and you know the world didn't know that there was a essentially an outbreak of the norovirus until we started talking about it here.   And that was, that was seen as kind of a big deal. You know, we'd taken data, we'd cleaned that data up and left the things that we're really interested in   But we kept going. You know that the strength of what we were doing was not simply just counting cases or counting diagnosis codes, we're looking at symptoms that that describe the person's visit to   the emergency room or what they called about the poison center for or they or they took a ride on the ambulance for.   chief complaint field, symptoms fields,   and free text fields. We looked into the into the fields that described the words that an EMS tech might use on the scene. We looked in fields that describe   the words that a nurse might use whenever someone first comes into the emergency room, and we looked at the words that a physician may may use. Maybe not what they clicked on the in in the boxes, but the actual words they used. And we we developed a metric around that as well.   This metric   was, you know, it let us know   you know, another month in advance that something was was odd in a particular area in North Carolina on a particular date. So I mentioned this was January 15 and this, this was December 6   and it was in the same area. And what is really registering is is the how much people are talking about a specific thing and if one person is talking about it,   it's not weighted very heavily, therefore, it wouldn't be a big deal. If two people are talking about it, if a nurse   and an EMS tech are talking about a specific set of symptoms, or mentioning a symptom several times, then, then we're measuring that and we're developing a metric from that information.   So if three people, you know, the, the doctor, the nurse and the EMS tech if that's what information we have is, if they're all talking about it,   then it's probably a pretty big deal. So that's what's happened here on December 6, a lot of people are talking about symptoms that would describe something like the norovirus.   This, this was related to an outbreak that the media started talking about in the middle of February. So, so this is seen as...as us telling the world about something that the media started talking about, you know, in a month later.   And   specific specifically you know, we were drawn to this Cape Fear region because a lot of the cases were we're in that area of North Carolina around Wilson,   Wilson County and that sort of thing. So, so that that was seen as something of interest that we could we could kind of drill in that far in advance of, you know, talk about something going on. Now   we carried on with that type of work concerning um, you know, using those tools for bio surveillance.   But what what we did later was, you know, after we set up systems that would that would, you know, was essentially running   every day, you know every hour, every day, that sort of thing. And then so whenever we would be able to say, well,   the system has predicted an outbreak, you know if this was noticed. The information was providing...was was really noise free in a sense. We we look back over time and we was   predicting let's say, between 20 and 30 alerts a year,   total alerts a year. So there was 20 or 30 situations where we had just given people, the, the, the notice that they might should look into something, you know, look, check something out. There might be you know a situation occurring. But in one of these instances,   the fellow that we worked with so much at Homeland Security came to us and said, okay, we believe your alert, so tell us something more about it. Tell us what   what it's made up of. That's that's that's how he put the question. So, so what we we did   was was develop a model, just right in front of him.   And the reason we were able to do that (and here's, here's the results of that model), the reason we were able to do that was by now, we realized the value of   of keeping data concerning symptoms relative to time and place and and all the different all the different pieces of data we could keep in relation to that, like age, like ethnicity.   So when we were asked, What's it made up of, then then we could... Let's put this right in the middle of the screen, close some of the other information around us here so you can just focus on that.   So when we're asked, okay, what's this outbreak made up of, you know, we, we built a model in front of them (Tony actually did that)   and that that seemed to have quite an impact when he did this, to say, Okay, you're right. Now we've told you today there there's there's an alert.   And you should pay attention to influenza cases in this particular area because it appears to be abnormal. But we could also tell them now that, okay   these cases are primarily made up of young people, people under the age of 16.   The symptoms, they're talking about when they go into emergency room or get on an ambulance is fever, coughing, respiratory issues. There's pain.   and there's gastrointestinal issues. The, the key piece of information we feel like is is the the interactions between age groups and the symptoms themselves.   While this one may, you know, it may not be seen as important is because it's down the list, we think it is,   and even these on down here. We talked about young people and dyspnea, and young people and gastro issues, and then older people.   So there was, you know, starting to see older people come into the data here as well. So we could talk about younger people, older people and and people in their   20s, 30s, 40s and 50s are not showing up in this outbreak at this time. So there's a couple of things here. When we could give people you know intel on the day of   of an alert happening and we could give them a symptom set to look for. You know when COVID 19 was was well into our country, you know you you still seem to turn on the news everyday and hear of a different symptom.   This is how we can deal with those types of things. You know, we can understand   you know, what what symptoms are surfacing such that people may may actually have, you know, have information to recognize when a problem is actually going to occur and exist.   So, so this is some of the things that you know we're talking about here, you'll think about how we can apply it now.   Using the the systems of alerting that I showed you earlier that, you know, I generally refer to as the TAP method as just using text analytics and proportional charting.   Well, you know, that's we're probably beyond that now, it's it's on us. So we didn't have the tool in place to to go looking then.   But these types of tools may still help us to be able to say, you know, this is these are the symptoms we're looking for. These are the   these are the age groups were interested in learning about as well. So, so let's let's keep walking through some ways that we could use what we learned back on that project to to help the situation with COVID 19.   One of the things that we did of course we've we've talked about building this this the symptoms database. The symptoms database is giving us information on a daily basis about symptoms that arise.   And and you know who's, who's sick and where they're sick at. So here's an extract from that database that we talked about, where it it has information on a date,   it has information about gender, ethnicity, in regions of North Carolina. We could you take this down to towns and and the zip codes or whatever was useful.   This I mentioned TAP in that text analytics information, well now we've got TAP information on symptoms. You know, so if people are talking about   this, say for example, nausea, then we we know how many people are talking about nausea on a day, and eventually in a place. And so this is just an extract of symptoms from   from this   this database. So, so let's take a look at how we could use this this. Let's say you wanted to come to me, an ER doctor, or some someone investigating COVID 19 might come to me and say,   well, where are people getting sick at. You know, that's where are people getting sick   now, or where might an outbreak be occurring in a particular area. Well, this is the type of thing we might do to demonstrate that.   I use Principal Components Analysis a lot. In this case because we've got this data set up, I can use this tool to identify   the stuff I'm interested in analyzing. In this case it's the regions, they asked, you know, the question was where, where and what. Okay what what are you interested in knowing about? So I hear people talk about respiratory issues   concerning COVID and I hear people talking about having a fever and and these are kind of elevated symptoms. These are issues that people are talking about   even more than they're writing things down. That's the idea of TAP is, is we're getting into those texts fields and understanding understanding interesting things. So once we we   we run this analyses,   JMP creates this wonderful graph for us. It's great for communicating what's going on. And what's going on in this case is that Charlotte, North Carolina,   is really maybe inundated with with with physicians and nurses and maybe EMS techs talking about their patients having a fever   and respiratory issues. If you want to get as far as you can away from that, you might spend time in Greensboro or Asheville, and if you're in Raleigh Durham, you might be aware of what's on the way.   So that this is this is a way that we can use this type of information for   for essentially intelligence, you know, intelligence into what what might be happening next in specific areas. We could also talk about severity in the same, in the same instance. We could talk about severity of cases and measure where they are the same way.   So you know the the keys here is is getting the symptoms database organized and utilized.   We've we use JMP to communicate these ideas. A graph like this may may have been shown to Homeland Security and we talked about it for two hours easily just with, not just questions about even validity,   you know, is where the data come from and so forth. We could talk about that and and we could also talk about   okay, this, this is the information that that you need to know, you know. This is information that will help you understand where people are getting sick at, such that warnings can be given and essentially life...lives saved.   So, so that's that in a sense is the system that we've we put together. The underlying key is, is the data.   Again, the data we've used is EMS, ED, poison center data. I don't have an example of the poison center data here, but I've got a long talk about how we how we use poison center data to surface foodborne illness, just in similar ways than what we've shown here.   And then the ability to, to, to be fairly dynamic with developing our story in front of people and talking to them   in, you know, selling belief in what we do. JMP helps us do that; SAS code helps us do that. That's a good combination tools and that's all I have for this this particular   topic. I appreciate your attention and hope you find it useful, and hope we can help you with this type of stuff. Thank you.
Tony Cooper, Principal Analytical Consultant, SAS Sam Edgemon, Principal Analytical Consultant, SAS Institute   In process product development, Design of Experiments (DOE), helps to answer questions like: Which X’s cause Y’s to change, in which direction, and by how much per unit change? How do we achieve a desired effect? And, which X’s might allow loser control, or require tighter control, or a different optimal setting? Information in historical data can help partially answer these questions and help run the right experiment. One of the issues with such non-DoE data is the challenge of multicollinearity. Detecting multicollinearity and understanding it allows the analyst to react appropriately. Variance Inflation Factors from the analysis of a model and Variable Clustering without respect to a model are two common approaches. The information from each is similar but not identical. Simple plots can add to the understanding but may not reveal enough information. Examples will be used to discuss the issues.     Auto-generated transcript...   Speaker Transcript Tony Hi, my name is Tony Cooper and I'll be presenting some work that Sam Edgemon and I did and I'll be representing both of us.   Sam and I are research statisticians in support of manufacturing and engineering for much of our careers. And today's topic is Understanding Variable Clustering is Crucial When Analyzing Stored Manufacturing Data.   I'll be using a single data set to present from. The focus of this presentation will be reading output. And I won't really have time to fill out all the dialog boxes to generate the output.   But I've saved all the scripts in the data table, which of course will avail be available in the JMP community.   The data...once the script is run, you can always use that red arrow to do the redo option and relaunch analysis and you'll see the dialog box that would have been used to present that piece of output.   I'll be using this single data set that is on manufacturing data.   And let's have a quick look at how this data looks   And   Sorry for having a   Quick. Look how this dataset looks. First of all, I have a timestamp. So this is a continuous process and some interval of time I come back and I get a bunch of information.   On some Y variables, some major output, some KPIs that, you know, the customer really cares about, these have to be in spec and so forth in order for us to ship it effectively. And then...so these are... would absolutely definitively be outputs. Right.   line speed, how at the set point for the vibration,   And and a bunch of other things. So you can see all the way across, I've got 60 odd variables that we're measuring at that moment in time. Some of them are sensors and some of them are a set points.   And in fact, let's look and see that some of them are indeed set points like the manufacturing heat set point. Some of them are commands which means like, here's what the PLC told the process to do. Some of them are measures which is the actual value that it's at right now.   Some of them are   ambient conditions, maybe I think that's an external temperature.   Some of them   are in a row material quality characteristics, and there's certainly some vision system output. So there's a bunch of things in the inputs right now.   And you can imagine, for instance, if I'm, if I've got the command and the measure the, the, what I told the the the zone want to be at and what it actually measure, that we hope that are correlated.   And we need to investigate some of that in order to think about what's going on. And that's multicolinnearity, understanding the interrelationships amongst   the inputs or separately, the understanding the multicollinearity and the outputs. But by and large, not doing Y cause X effect right now. You're not doing a supervised technique. This is an unsupervised technique, but I'm just trying to understand what's going on in the data in that fashion.   And all my scripts are over here. So yeah, here's the abstract we talked about; here's a review of the data. And as we just said it may explain why there is so much, multicollinearity, because I do have a set points and an actuals in there. So, but we'll, we'll learn about those things.   What we're going to do first is we're going to fit Y equals a function of X and we're going to set it and I'm only going to look at these two variables right now.   The zone one command, so what I told zone one to to to be, if you will. And then I'm I've got a little therma couple that are, a little therma couple in the zone one area and it's measuring.   And you can see that clearly as zone one temperature gets higher, this Y4 gets lower, this response variable gets lower.   And that's true also for the measurement and you can see that in the in the in the estimates both negative and, by and large, and by the way, these are fairly   predictive variables in the sense that just this one variable is explaining 50% of what's going on in in the Y4. Well, let's   let's do a multivariate. So you imagine I'm now gonna fit model and have moved both of those variables into   into into my model and I'm still analyzing Y4. My response is still Y4. Oh, look at this. Now it's suggesting that as Y, as the command goes up. Yeah, that that does   that does the opposite of what I expect. This is still negative in the right direction, but look at   look at some of these numbers. These aren't even in the same the same ballpark as what I had a moment ago, which was .04 and .07 negative. Now I have positive .4 and negative .87.   I'm not sure I can trust these this model from an engineering standpoint, and I really wonder how it's characterizing and helping me understand this process.   And there's a clue. And this may not be on by default, but you can right click on this parameter estimates table and you can ask for the VIF column.   That stands for variation inflation factor. And that is an estimate of the...of the the instability of the model due to   this multicollinearity. So you can...so we need to get little intuition on what that...how to think about that variable. But just to be a little clearer what's going on, I'm going to plot...here's   temperature zone one command and here's measure, and as you would expect,   as you tell...you tell the zone to increase in temperature. Yes, the zone does increase in temperature and by and larges, it's, it's going to maybe even the right values. I've got this color coded by Y4 so it is suggesting at lower temperatures, I get   the high values of Y4 and that the...   sorry...yeah, at low valleys of temperature, I got high values of Y4 and just the way I saw on the individual plots.   But you can see maybe the problem gets some intuition as to why the multivariate model didn't work. You're trying to fit a three dom... a surface.   over this knife edge and it can obviously...there's some instability and if...you can easily imagine it's it's not well pinneed on the side so I can rock back and forward here and that's what you're getting.   It is in terms of how that is that the in in terms of that the variation inflation factor. The OLS software...the OLS ordinary least squares analysis typical regression analysis   just can't can't handle it in the, in some sense, maybe we could also talk about the X prime X matrix being, you know, being almost singlular in places.   So we've got some heurustic of why it's happening. Let's go back and think about more   About   About   The about about the values and   We know that   You know we are we now know that variation inflation factor actually does think about more than just pairwise. But if I want to source...so intuition helps me think about the relationship between   VIF and pairwise comparison. Like if I have two variables that are 60% correlated   then it's you know if it was all it was all pairwise then the VIF would be about 2.5.   And if they were 90% correlated two variables, then I would get a VIF of 10. And the literature says   That when you have VIFs of 10 and more you have enough instability which way you should worry about your model. So, you know, because in some sense, you've got 90% correlation between things in your data.   Now whether 10 is a good cut off really depends,I guess, on your application. If if you're making real design decisions based on it, I think a cut off of 10 is way too high.   And maybe even near two is better. But if I was thinking more. Well, let me think about what factors to put in a run in an experiment like   I want to learn how to run the right experiment. And I knew I've got to pick factors to go in it. And I know we too often go after the usual suspects. And is there a way I can push myself to think of some better factors. Well, that then maybe a   10 might be useful to help you narrow down. So it really depends on where you want to go in terms of doing thinking about   thinking about what what the purpose is. So more on this idea of purpose.   You know, there's two main purposes, if you will, to me of modeling. One is what will happen, you know, that's prediction.   And but and that's different, sometimes from why will it happen and that's more like explanation.   As we just saw with a very simple   command and measure on zone one, you can, you cannot do very good explanation. I would not trust that model to think about why something's happening when I have when I when the, when the responses when the, when the estimate seem to go in the wrong direction like that.   So it's very so I wouldn't use it for explanation. I'm sort of suggesting that I wouldn't even use it for prediction. Because if I can't understand the model, it's doing, it's not intuitively doing what I expect   that seems to make extrapolation dangerous, of course, by definition, you know, I think prediction to a future event is extrapolation in time. Right, so I would always think about this. And, you know, we're talking about ordinary least squares so far.   All my modeling techniques I see, like decision trees, petition analysis,   are all in some way affected by this by this issue, in different unexpected ways. So it seems a good idea to take care of it if you can. And it isn't unique to manufacturing data.   But it's probably worse exaggerated in manufacturing data because often the variables can are not controlled. If we   if we have, you know, zone one temperature and we've learned that it needs to be this value to make good product, then we will control it as tightly as possible to that desired value.   And so it's it's so the the extrapolation really come gets harder and harder. So this is exaggerated manufacturing because we do can control them.   And there's some other things about manufacturing data you can read here that make it maybe   make it better, because this is opportunities and challenges. Better is you can understand manufacturing processes. There's a lot of science and engineering around how those things run.   And you can understand that stoichiometry for instance, requires that the amount of A you add has, chemical A has to be in relationship to chemical B.   Or you know you don't want to go from zone one temperature here to something vastly different because you need to ramp it up slowly, otherwise you'll just create stress.   So you can and should understand your process and that may be one way even without any of this empirical evidence on on multicollinearity, you can get rid of some of it.   There's also   an advantage to manufacturing data is in that it's almost always time based, so do plot the variables over time. And it's always interesting   or not always, but often interesting, the manufacturing data does seem to move in blocks of times like here...we think it should be 250 and we run it for months, maybe even years,   and then suddenly, someone says, you know, we've done something different or we've got a new idea. Let's move the temperature. And so very different.   And of course, if you're thinking about why is there multicollinearity,   we've talked about it could be due to physics or chemistry, but it could be due to some latent variable, of course, especially when there's a concern with variable shifting in time, like we just saw.   Anything that also changes at that rate could be the actual thing that's affecting the, the, the values, the, the Y4. Could be due to a control plan.   It could be correlated by design. And each of these things, you know, each of these reasons for multicollinearity and the question I always ask is, you know,   is this is it physically possible to try other combinations or not just physically possible, does it make sense to try other combinations?   In which case you're leaning towards doing experimentation and this is very helpful. This modeling and looking retrospectively at your data to is very helpful at helping you design better experiments.   Sometimes though the expectations are a little too high, in my mind, we seem to expect a definitive answer from this retrospective data.   So we've talked about two methods already two and a half a few to to address multi variable clustering and understand it. One is the is VIF.   And here's the VIF on a bigger model with all the variables in.   How would I think about which are correlated with which? This is tells me I have a lot of problems.   But it does not tell me how to deal with these problems. So this is good at helping you say, oh, this is not going to be a good model.   But it's not maybe helpful getting you out of it. And if I was to put interactions in here, it'd be even worse because they are almost always more correlated. So we need another technique.   And that is variable clustering. And this, this is available in JMP and there's two ways to get to it. You can go through the Analyze multivariate principal components.   Or you can go straight to clustering cluster variables. If you go through the principal components, you got to use the red option...red triangle option to get the cluster variables. I still prefer this method, the PCA because I like the extra output.   But it is based on PCA. So what we're going to do is we're going to talk about PCA first and then we will talk about the output from variables clustering.   And there's and there's the JMP page. In order to talk about principal components, I'm actually going to work with standardized versions of the variables first. And let's think remind ourselves what is standardized.   the mean is now zero, standard deviation is now 1. So I've standardized what that means that they're all on the same scale now and implicitly when you do   when you do   principal components on correlations in JMP,   implicitly you are doing on standardized variables.   JMP is, of course, more than capable, a more than smart enough for you to put in the original values   and for it to work in the correct way and then when it outputs, it, it will outputs, you know, formula, it will figure out what the right   formula should have been, given, you know, you've gone back unstandardized. But just as a first look, just so we can see where the formula are and some things, I'm going to work with standardized variables for a while, but we will quickly go back, but I just want to see one formula.   And that's this formula right here. And what I'm going to do is think about the output. So what is the purpose of   of   of PCA? Well it's it's called a variation reduction technique, but what it does, it looks for linear combinations of your variables.   And if it finds a linear combination that it likes, it...that's called a principal component.   And it uses Eigen analysis to do this.   So another way to think about it is, you know I want it, I put in 60 odd variables into the...into the inputs.   There are not 60 independent dimensions that I can then manipulate in this process, they're correlated. What I do to the command for time, for temperatures on one   dictates almost hugely what happens with the actual in temperature one. So those aren't two independent variables. Those don't keep you say you don't you don't have two dimensions there. So how many dimensions do we have?   That's what this thing, the eigenvalues, tell you. These are like variance. These are estimates of variance and the cutoff is   one. And if I had the whole table here, there'll be 60 eigenvalues. Go all the way down, you know, jumpstart reporting at .97 but the next one down is, you know, probably .965 and it will keep on going down.   The cutoff says that...or some guideline is that if it's greater than one, then this linear combination is explaining a signal. And if it's less than one, then it's just explaining noise.   And so what what JMP does is it...when I go to the variable clustering, it says, you know what   you have a second dimension here. That means a second group of variables, right, that's explaining something a little bit differently. Now I'm going to separate them into two groups. And then I'm going to do PCA on both,   and if and the eigenvalues for both...the first one will be big, but what's the second one look like   after I split in two groups? Are they now less than one? Well, if they're less than one, you're done. But if they're greater than one, it's like, oh, that group can be split even further.   And it was split that even further interactively and directly and divisively until there's no second components greater than one anymore.   So it's creating these linear combinations and you can see the...you know when I save principal component one, it's exactly these formula. Oops.   It's exactly this formula, .0612 and that's the formula for Prin1 and this will be exactly right as long as you standardize. If you don't standardize then JMP is just going to figure out how to do it on the unstandardized which is to say it's more than capable of doing.   So let's start working with the, the initial data.   And we'll do our example. And you'll see this is very similar output. It's in fact it's the same output except one thing I've turned on.   is this option right here, cluster variables. And what I get down here is that cluster variables output. And you can see these numbers are the same. And you can already start to see some of the thinking it's going to have to do, like, let's look at these two right here.   It's the amount of A I put in seems to be highly related to the amount of B I put in, so...and that would make sense for most chemical processes, if that's part of the chemical reaction that you're trying to accomplish.   You know, if you want to put 10% more in, probably going to put 10% more B in. So even in this ray plot you start to see some things that the suggest the multicollinearity. And so to get somewhere.   But I want to put them in distinct groups and this is a little hard because   watch this guy right here, temperature zone 4.   He's actually the opposite. They're sort of the same angle, but in the opposite direction right, so he's 100 or 180 degrees almost from A and B.   So maybe ...negatively correlated to zone 4 temperature and A B and not...but I also want to put them in exclusive groups and that's what that's what we get   when we when we asked for the variable clustering. So it took those very same things and put them in distinct groups and it put them in eight distinct groups.   And here are the standardized coefficients. So these are the formula   that the for the, you know, for the individual clusters. And so when I save   the cluster components   I get a very similar to what we did with Prin1 except this just for cluster 1, because you notice that in this row that has zone one command with a ... with a .44, everywhere else is zero. Every variable is exclusively in one cluster or another.   So let me...let's talk about some of the output.   And so we're doing variable clustering and   Oops.   Sorry. Tony And then we got some tables in our output. So I'm going to close...minimize this window and we're talking about what's in here in terms of output.   And the first thing you guys that you know I want to point us to is the standardized estimates and we were doing that. And if you want to do it, quote, you know,   by hand, if you will, repeat and how do I get a .35238 here, I could run PCA on just cluster one variables. These are the variables staying in there. And then you could look at the eigenvalues, eigenvectors, and these are the exact exact numbers.   So,   So the .352 is so it's just what I said. It keeps devisively doing multiple PCAs and you can see that the second principle component here is, oh sorry, is less than one. So that's why it stopped at that component   Who's in there   cluster one, well there's temperature is ...the two zone one measures, a zone three A measure, the amount of water seems to matter compared to that. With some of the other temperature over here and in what...in cluster six.   This is a very helpful color coded table. This is organized by cluster. So this is cluster one; I can hover over it and I can read all of the things   added water (while I said I should, yeah, added water. Sorry, let me get one that's my temperature three.)   And what that's a positive correlation, you know, the is interesting zone 4 has a negative correlation there. So you will definitely see blocks of of color to...because these are...so this is cluster one obviously, this is maybe cluster three.   This, I know it's cluster six.   Butlook over here... that this...as you can imagine, cluster one and three are somewhat correlated. We start to see some ideas about what might what we might do. So we're starting to get some intuition as to what's going on in the data.   Let's explore this table right here. This is the with own cluster, with next cluster and one minus r squared ratio. And what...I'm going to save that to my own data set.   I'm going to run do a multivariate on it. So I've got cluster one through eight components and all the factors. And what I'm really interested in is this part of a table, like for all of the original variables and how they cluster with the component. So let me save that table and   then what I'm gonna do is, I'm gonna delete, delete some extra rows out of this table, but it's the same table. Let's just delete some rows.   And I'll turn and...so we can focus on certain ones. So what I've got left is the columns that are the cluster components and the rows...   row column...the role of the different...and is 27 now variables that we were thinking about, not 60 (sorry) and it's put them in 8 different groups and I've added that number. They can come automatically.   I'm going to start and in order to continue to work, what I'm gonna do is, I don't want these as correlations. I want these as R squares. So I'm going to square all those numbers.   So I just squared them and and here we go. Now we can look at, now we can start thinking about it.   And I've sort...so let's look at   row one.   Sorry, this one that the temperature one measure we've talked about, is 96% correlated with its...with cluster one.   It's 1% correlated with cluster two so it looks to me like it really, really wants to be in cluster one and it doesn't really have much to say I want to go into cluster two. What I want to do is let's find all of those...   lets color code some things here so we can find them faster. So   we're talking zone one meas and the one that would like to be in, if anything, is cluster five.   You know it's 96% correlated, but you know, it's, it wouldn't be, if it had to leave this cluster, where would it go next? It would think about cluster five.   And what's in cluster five? Well, you could certainly start looking at that. So there's more temperatures there. The moisture heat set point and so forth. So if it hadn't...so this number,   The with the cl...with its own cluster talks about how much it likes being in its cluster and the what...and the R squared to the next cluster talks to you about how much it would like to be in a different one. And those are the two things reported in the table.   You know, JMP doesn't show all the correlation, but it may be worth plotting and as we just demonstrated, it's not hard to do.   So this tells you...I really like being in my own cluster. This says if I had to leave, then I don't mind going over here. Let's compare those two. And let's take one minus this number,   divided by one minus this number. That's the one minus r squared ratio. So it's a ratio of how much you like your own cluster divided by how much you attempted to go to another cluster.   And let's plot some of those.   And   Let me look for the local data filter on there.   The cluster.   And and here's the thing. So   Values, in some sense, lower values of this ratio are better. These are the ones over here (let's highlight him)...   Well, let's highlight the very...this one of the top here.   I like the one down here. Sorry.   This one, vacuum set point, you can see really really liked its own cluster over the next nearest, .97 vs .0005, so you know that the ratio is near zero. So those wouldn't...   with it...you wouldn't want to move that over. And you could start to do things like let's just think about the cluster one variables. If anyone wanted to leave it, maybe it's the Pct_reclaim. Maybe he got in there,   like, you know by, you know, by fortuitous chance and I've got, you know, and if I was in cluster two, I can start thinking about the line speed.   The last table I'm going to talk about is the cluster summary table. That's   this table here.   And it's saying...it's gone through this list and said that if I look down r squared own cluster, the highest number is .967 for cluster one.   So maybe that's the most representative.   To me, it may not be wise to let software decide I'm going to keep this late...this variable and only this variable, although certainly every software   has a button that says launch it with just the main ones that you think, and that's that may give you a stable model, in some sense, but I think it's short changing   the kinds of things you can do. And you and, hopefully with the techniques provided, we can...you have now the ability to explore those other things.   This is...you can calculate this number by again doing the just the PCA on its own cluster and it and it it's suggesting it cluster...eigenvalue one one complaint with explained .69% so that's another...just thinking about it, it's own cluster.   Close these and let's summarize.   So we've given you several techniques. Failing to understand what's clear, you can make your data hard to interpret, even misleading the models.   Understanding the data as it relates to the process can produce multicollinearity. So just an understanding from an SME standpoint, subject matter expertise standpoint.   Pairwise and scatter plot are easier to interpret but miss multicollinearity and so you need more. VIF is great at telling you you've got a problem. It's only really available for ordinary least squares.   There's no, there's no comparative thing for prediction.   And there's some guidelines and variable clustering is based on principal components and shows better membership and strengthens relationship in each group.   And I hope that that review or the introduction would encourage you, maybe, to grab, as I say, this data set from JMP Discovery and you can run the script yourself and you can go obviously more slowly and make sure that you feel good and practice on maybe your own data or something.   One last plug is, you know, there was there's other material and other times that Sam and I have spoken at JMP Discovery conferences and or at   ...or written white papers for JMP and you know maybe you might want to think about all we thought about with manufacturing it because we have found that   modeling manufacturing data takes...is a little unique compared to understanding like customer data in telecom where a lot of us learn when we went through school. And I thank you for your attention and good luck   with further analysis.