Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Laura Lancaster, JMP Principal Research Statistician Developer, SAS Jianfeng Ding, JMP Senior Research Statistician Developer, SAS Annie Zangi, JMP Senior Research Statistician Developer, SAS   JMP has several new quality platforms and features – modernized process capability in Distribution, CUSUM Control Chart and Model Driven Multivariate Control Chart – that make quality analysis easier and more effective than ever. The long-standing Distribution platform has been updated for JMP 15 with a more modern and feature-rich process capability report that now matches the capability reports in Process Capability and Control Chart Builder. We will demonstrate how the new process capability features in Distribution make capability analysis easier with an integrated process improvement approach. The CUSUM Control Chart platform was designed to help users detect small shifts in their process over time, such as gradual drift, where Shewhart charts can be less effective. We will demonstrate how to use the CUSUM Control Chart platform and use average run length to assess the chart performance. The Model Driven Multivariate Control Chart (MDMCC) platform, new in JMP 15, was designed for users who monitor large amounts of highly correlated process variables. We will demonstrate how MDMCC can be used in conjunction with the PCA and PLS platforms to monitor multivariate process variation over time, give advanced warnings of process shifts and suggest probable causes of process changes.
The purpose of this poster presentation is to display COVID-19 morbidity and mortality data available on-line from Our World in Data whose contributors ask the key question: “How many tests to find one COVID-19 case?” We use SAS JMP Analyze to help answer the question. Smoothing test data from Our World in Data, yields seven-day moving average or SMA(7) total tests per thousand in five countries for which coronavirus test data are reported: Belgium, Italy, South Korea, the United Kingdom and United States. Similarly, seven-day moving average or SMA(7) total cases per million of were derived using the Time Series Smoothing option. Coronavirus tests per case were calculated by dividing smoothed total tests by smoothed total cases and multiplying by a factor of 1,000. These ratios of smoothed tests to smoothed cases were themselves smoothed. Additionally, Box-Jenkins ARIMA(1,1,1) time series models were fitted to smoothed total deaths per million to graphically compare smoothed case-fatality rates with smoothed tests per case ratios.    (view in My Videos)   Auto generated transcript:     Auto-generated transcript...   Speaker Transcript Douglas Okamoto In our poster presentation we display COVID-19 data available from our world and data, who's database sponsors, ask the question why is data on testing important We use JMP version. To help us answer the question. Seven Day moving averages are calculated from January 21 to July 21 Daily per capita COVID-19 tests and coronavirus tests in seven countries United States, Italy, Spain, Germany, Great Britain, Belgium and South Korea. Core by owners test per case where calculated by dividing smooth test by smooth cases and multiplying by a factor 1000 Daily COVID-19 test data yields smoothed test data per thousand in Figure one Testing in LA states in blue trims upward with two tests per thousand daily on July 21st 10 times more than South Korea in red. Which trends downward The x axis is normalized the figure one, two days since moving averages number one or more tests per thousand. In figure two smooth coronavirus cases per million in Europe and South Korea trend downward after peaking months earlier than the US in blue, which averaged 2200 cases per month million on July 21st, with no end in sight. The x axis is normalized to the number of days since moving averages of 10 or more cases per million. Combining tabular results from figure one and figure to smooth COVID-19 test per case in Figure three shows South Korean testing in red peaks at 685 tests per case in May 38 times USP performance in lieu Of 22 tests per case in June. Since the x axis is dated figure three represents a time series. The reciprocal of tests for case cases protest is a measure of product to a positivity one in 22 or 4.5% of positive cases in the US compares with 0.15% positivity in South Korea. And 0.5 to 1.0% in Europe. At a March 30 who press briefing. Dr. Michael Ryan suggested a positive rate less than 10% or even better, less than 3% as a general benchmark of adequate testing. JMP analysis JMP analyzed was used to fit Box Jenkins time series models to smooth test per case in the US for March 13 of April 25 predictive values from April 26 two main ninth or forecast from a fitted model and auto-regressive integrated moving average or ARIMA 111 Model the figure for a time surge of smooth tests per case from mid March to April shows a rise in the number of us test for case not a decline as predicted during the 14 day forecast period. In summary, 10 or more test cases tests were performed per case to provide adequate testing in the United States COVID-19 testing in Europe and South Korea was more than adequate with hundreds of tests per case. Equivalent only the positive rate or number of cases protest was less than 10% in the US. Whereas positivity in Europe and South Korea was well under 3% When our poster was submitted the US totaled 4 million coronavirus cases more than your European countries and South Korea combined Us continues to be plagued by state by state disease outbreaks. Thank you.  
Pranjal Taskar, Formulation Scientist II, Thermo Fisher Scientific Brian Greco, Formulation Scientist I, Thermo Fisher Scientific Sabrina Zojwala, Formulation Scientist I, Thermo Fisher Scientific Kat Brookhart, Manager, Formulation & Process Development, Thermo Fisher Scientific Sanjay Konagurthu, Sr. Director, Science and Innovation, Drug Product NA Division Support, Thermo Fisher Scientific   Pharmaceutical tableting is a process in which an active moiety is blended with inert excipients to achieve a compressible mixture. This mixture is consolidated into the final dosage form: a tablet. The process of tableting considers different composition-related and process variables impacting quality attributes of the final product. This work focuses on using JMP software to identify main effects. An I-optimal, 19-run custom design was outlined with the factors being type and ratio of filler used (microcrystalline cellulose, mannitol vs lactose, categorical), percentage active spray dried dispersion loading (continuous), order and amount of addition (intragranular vs. extragranular, continuous), and ribbon solid fraction (continuous). The responses were outlined as bulk density, Hausner ratio, percentage fines, blend compressibility and tablet disintegration. The model evaluated with the main effects and second degree interaction terms. The data was evaluated using Standard Least Squares in the Fit Model function. Results determined that lactose provided the blend with a higher initial bulk density, however mannitol maintained bulk density post compression. Microcrystalline cellulose improved flow properties of the blend and high percentage intragranular addition provided material with higher bulk density and improved material flow.     Auto-generated transcript...   Speaker Transcript Pranjal Taskar All right. Thank you, Peter. So I'm going to get started now. Hello everyone. Today I'm going to talk about my poster. This poster is regarding systematic analysis of targeting, which includes effect of formulation and process variables on final quality attributes of my product. So delving into all the statistical analysis before that, I wanted to give a background about what exactly we're talking about. What is tableting? Tableting is a pharmaceutical process. Looking over in the introduction, I'm going to talk about what tableting is a little bit. It's a pharmaceutical process in which your active ingredient or active moiety (API) is blended with other excipients to form a free flowing good flowing blend and this blend is compressed into our final dosage form, which is a tablet. So in a lot of situations, there are some active moieties or APIs, as we would call it, that have a low bioavailability and that could be due to their crystalline nature. They're just too stable, too rigid in their ways. So our site kind of specializes into making this crystalline API, a little bit more soluble, little bit more reactive amorphous form and it makes it into like a more bioavailable form. And when we do that, we fortified this API by a polymer. This this intermediate that we form is a tablet intermediate called a spray dried intermediate, SDI. And this is what we basically use in our tablets as our active intermediate. But when you look at it, it has poor flow ability and it's extremely fluffy. So when you have to incorporate this API into your tablet, you need to have other pharmaceutical processes involved to make it more streamlined, to make the blend more flowable. So this is what we're going to do. In this study, we are going to identify our critical quality attributes, the variables that matter, or our dependent variables and then we are going to identify variables that impact our critical quality attributes, which are the composition of that tablet of that blend and then different process processing parameters that we used in us in tableting. Which of these are main effects? Are there any interactions? And then we'll use JMP to identify all of these main effect and interaction variables and try to catch out the tableting process basically. So this was the introduction. Moving on to the methods and objectives. So how do we do this? For this study we looked at a placebo formulation. There is no active product or actor moiety and we used a commonly used spray-dried polymer which is hypromellose acetate succinate. We spray dried it and made it into the fluffy blend that it usually is. And Figure 2 talks about our usual granulation tabulating process. So, what, what we do is basically have our spray-dried intermediate (SDI) blended along with other excipients using this blender. We move on to roller compaction, which is densification of this blend using these there are rollers right here and these rollers move slowly to densify the blend which goes into this hopper and you get ribbons out of the roller compactor. Now what you have done is you have made that fluffy material into densified ribbons and you mill it down using a comil. And you get granules. These granules are more dense and they are a lot better flowing than your API or your SDI. So looking at this entire process, there are a lot of variables that go in there that you need to change and look out for. So what are those variables? This diagram over here will identify different kinds of variables, the independent variable variables that go into the formulation and process. so The first variable would be a bit more base formulation related than the...rather than the process related. So it would talk about different types of ??? excipients that are used. And the ratio of these excipients that I used the percent of SDI loading, or active loading, and in our case, the placebo loading. And then the order of addition and the point of addition at where the SDI, or other excipients are loaded into the formulation. And then sorting process related parameters such as ribbon solid fraction, which basically talks about this equipment, the roller compactor and the speed at which the rollers and the spools move. We have also identified independent variables of our critical quality attributes that we look out for, which is bulk density of our blend, Which we look at before and after granulation and you have labeled it bulk density 1 and 2. Hausner ratio, which is again a ratio that depicts the flow of your blend and we also identify that before and after granulation, labeled as Hausner ratio 1 and 2. And the percent of fines that collect...are collected in the roller compaction process. And this is usually monitored after granulation. So all of these points out to talk about basically our method and why we chose our variables. What we did was we had an I optimal, 19-run custom design looking at all of these independent variables impacting on the dependent variables. And the way we analyze this model or the way we constructed effects, was that we looked at the main effect and the second degree interactions and we analyzed the data using the standard least squares personality in the fit model function. So, Identifying the process and the objectives, we will move on to results, but before doing that really quickly, I wanted to look at the JMP window which I have pulled up right now. These different columns are my independent and dependent variables and I'm going to highlight right here, these are the different independent variables that we are going to be looking at. So type of filler, which is the type of inert excipient and we have looked at mannitol and lactose. percent SDI, which is the active or in our case placebo loading, looking at highs and lows away here; and amount intragranular, so the amount of our excipients that we add before the roller compaction versus after the roller compaction and outline here are 75 and 95; and mannitol and lactose, which is a filler to MCC, which is micro course design cellulose ratio. Mannitol lactose are, I would say a little bit more excipient and MCC is more ???, gives more strength to the blend. So we have looked at a ratio of this to see how it impacts our tableting blend overall. And on the right are our responses. Bulk density 1, Hausner ratio 1, which is before granulation. Bulk density 2 and Hausner ratio 2, which is after granulation, and percent fines. So I'm gonna go over here quickly into this window and look at how we created our model, our response variables y, that I just talked about. And then our model effects which are secondary interactions and main effects. Standard least squares. That's what we used and I run the model. This is my effect summary right here and based on this data that we're looking at and prior experience, I'm going to take off the last two effects. Just remove that extra noise and then over here, I have my responses and how the data kind of impacts these responses. It would be just easier if we go down and look at the prediction profiler over here. And how all of these dependent variables are impacted by this. So I think it might just be easier if I pull up my poster and... Alright, so looking at the results over here, what we found out from Figure 3 was that, look, the two fillers lactose had higher bulk density initially, but post ruler compaction, the bulk density two of these fillers dropped and you can see a corresponding increase in the fines. So what we think would have happened is that lactose is more brittle in comparison to mannitol. And this generated all of that attrition and that fines and that impacted the flow, making it less bulky, drop in the bulk density. And the Hausner ratio, a little bit higher with the lactose. So basically, what we're doing is targeting a higher bulk density and we want a lower Hausner because a lower Hausner indicates a better flowing blend. So looking at the data, mannitol had a slight edge over lactose as a filler. And the, the second point would be talking about the solid fraction and overall we saw that there was a slight plateauing effect at around .6 solid fraction. Overall, we see that .7 has the least number of fines, which is why we see a recommended .7 with a maximum and desirability, but the plateau effect in terms of your flow properties (bulk and Hausner) start bottoming out at around .6 and onwards. that having lower SDI in general in the formulation had overall better flow properties. Just because the SDI, it's fluffy and it causes the blend to flow a lot worse. So the design just suggested us to have lower SDI loading. a higher amount of that ingredient of that excipient added in an intragranular fashion than an extragranular, just because it improves your bulk, it has a lower Hausner which means that your blend is flowing smoother. We also observed that mannitol to lactose ratio having more of that critical component was more desirable and I see that because overall, the fines have dropped in the presence of having a little bit more of the mannitol lactose component. And that could be the reason why we are seeing this. We also have in the Figure 4, a couple of surface plots of a few interesting trends that I saw. And in Figure 4A, you can see that having a lower SDI loading and having more amount intragranularly resulted in this hotspot right here of a very high Hausner ratio. So when you add a lot of...when you have a low....I'm sorry...have a higher SDI and higher intragranular had an extremely high Hausner ratio. So what this says is basically when you have more of that fluffy material intragranularly, your flow is going to be bad, but you correspond that after granulation, when you again have more more of your excipient intragranularly and you're targeting a solid fraction of about .6 and about, your bulk density improves. So you're basically post granulation, your blend is getting more denser and this is what these two diagrams talk about. So all of the result points basically talk about these things that I discussed right now. Overall, we conclude from our study that in order to optimize this process and maximize desirability for formulations, 1, a higher ratio intragranularly and a lower SDI loading would be a preferable formulation and targeting a solid fraction of around 0.6 would also be beneficial to the formulation. Thank you very much. I would welcome your questions.  
Kelci Miclaus, Senior Manager Advanced Analytics R&D, JMP Life Sciences, SAS   Reporting, tracking and analyzing adverse events occurring to patients is critical in the safety assessment of a clinical trial. More and more, pharmaceutical companies and the regulatory agencies to whom they submit new drug applications are using JMP Clinical to help in this assessment. Typical biometric analysis programming teams may create pages and pages of static tables, listings and figures for medical monitors and reviewers. This leads to inefficiencies when the doctors that understand medical impacts of the occurrence of certain events can not directly interact with adverse event summaries. Yet even simple count and frequency distributions of adverse events are not always so simple to create. In this presentation we focus on key reports in JMP Clinical to compute adverse event counts, frequencies, incidence, incidence rates and time to event occurrence. The out of the box reports in JMP Clinical allow fully dynamic adverse event analysis to look easy even while performing complex computations that rely heavily on JMP formulas, data filters, custom-scripted column switchers and virtually joined tables.      Auto-generated transcript...   Speaker Transcript Kelci J. Miclaus Hello and welcome to JMP Discovery Online. Today I'll be talking about summarizing adverse event summaries and clinical trial analysis. I am the Senior Manager in the advanced analytics group for the JMP Life Sciences division here at SAS, and we work heavily with customers using genomic and clinical data in their research. So before I go through the summarizing and the details around using JMP with adverse event analyses, I want to introduce the JMP Clinical software which our team creates. JMP Clinical is one of the family of products that includes now five official products as well as add ins, which can extend JMP to really allow you to have as many types of vertical applications or extensions of JMP as you want. My development team supports JMP Genomics and JMP Clinical. JMP Genomics and JMP Clinical are respectively vertical applications that are customized, built on top of JMP, that are used for genomic research and clinical trial research. And today I'll be talking about how we've created reviews and analyses in JMP Clinical for pharmaceutical industries that are doing clinical trials safety and early efficacy analysis. The original purpose of JMP Clinical and the instigation of this product actually came through assistance to the FDA, which is a heavy JMP user And their CDER group, the Center for Drug Evaluation and Research. Their medical reviewers were commonly using JMP to help review drugs submissions. And they love it. They're very accomplished with it. One of the things they found though is that certain repetitive actions, especially on very standard clinical data could be pretty painful. Example here is the idea of something called a shift plot which is for laboratory measurements where you compare the trial average of a laboratory of versus the baseline against treatment groups. In order to create this, it took at least eight to 10 steps within the JMP interface of opening up the data, normalizing the data, subsetting it out into baseline versus trial, doing statistics, respectively, for those groups merging it back in, then splitting that data by lab tests so you could make this type of plot for each lab. And that's not even to get to the number of steps within Graph Builder to build it. So JMP clearly can do it, but what we wanted to do is solve their pain at this very standard type of clinical data with a one-click lab shift plots, for example. In fact, we wanted to create clinical reviews in our infrastructure that we call the review builder that are one-click standardized reproducible reviews for many of the highly common standard analyses and visualizations that are required or expected in clinical trial research to evaluate drug safety and efficacy. So JMP Clinical has evolved since that first instigation of creating a custom application for a shift plot into a full-service clinical...clinical trial analysis software that covers medical monitoring and clinical data science, medical writing teams, biometrics and biostatistics, as well as data management around the study data involved with clinical trial collection. This goes for both safety and efficacy but also operational integrity or operational anomalies that might be found in the collection of clinical data as well. Some of the key features around JMP Clinical that we find to be especially useful for those that are using the JMP interface for any types of analyses are things like virtual joins. So we have an idea of a global review subject filter, which I'll show you during the demonstrations for adverse events, that really allow you to integrate and link the demography information or the demographics about our subjects on a clinical trial to all of the clinical domain data that's collected. And this architecture, which is enabled by virtual joins within the JMP interface with row state synchronization, allow you to really have instantaneous interactive reviews with very little to no data manipulation across all the types of analyses you might be doing in a clinical trial data analysis. Another new feature we've added to the software that also leverages some of the power of the JMP data filter, as well as creation of JMP indicator columns, is this ability to, while you're interactively reviewing clinical trial data, find interesting signals that say, in this example, the screenshot shown is subjects that had a serious adverse event while on the clinical trial, find those interesting signals, and quite immediately, create an indicator flag that is stored in metadata with your study in JMP Clinical that's available for all other types of analyses you might do. So you can say, I want to look now at my laboratory results for patients that had a serious adverse event versus those that didn't to see if there's also anomalies that might be related to an adverse event severity occurrence. Another feature that I'll also be showing with JMP Cclinical and the demonstration around adverse event analysis is the JMP Clinical API that we've built into the system. One of the most difficult things of providing and creating and developing a vertical application that has out-of-the box one-click reports is that you get 90% of the way there and then the customer might say, oh, well, I really wanted to tweak it, or I really wanted to look at it this way, or I need to change the way the data view shows up. So one of the things we've been working hard on in our development team is using JMP scripting JSL to surface an API into the clinical review, to have control over the objects and the displays and the dashboards and the analyses and even the data sets that go into our clinical reviews. So I'll also be showing some of that in the adverse event analysis. So let's back up a little bit and go into the meat of adverse events and clinical trials now that we have an overview of JMP Clinical. There's really two kind of key ways of thinking of this. There's that safety review aspect of a clinical trial where that's typically counts and percentages of the adverse events that might occur. And a lot of the medical doctors, monitors, or reviewers often use this data to understand medical anomalies, you know, a certain adverse event starts showing up more commonly, with one of the treatments that could have medical implications. There's also the statistical signal detection, the idea of statistically assessing our adverse events occurring at an unusual rate in one of the treatment groups versus the other. So here, for example, is a traditional static table that you see in many of the types of research or submissions or communications around a clinical trial adverse event analysis. Basically it's a static table with counts percents and if it is more statistically oriented, you'll see things like confidence intervals and p values as well around things like odds ratios or a relative risks or rate differences. Another way of viewing this can also be visually instead of with a tabular format so signal detection, looking at say odds ratio or the, the risk difference might use the Graph Builder in this case to show the results of a statistical analysis of the incidence of certain adverse events and how they differ between treatment groups, for example. So those are two examples. And in fact, from the work we've done and the customers we've worked with around how they view and have to analyze adverse events, the JMP Clinical system now offers several common adverse event analyses from simple counts and percentages to incidence rates or occurrences into statistical metrics such as risk difference, relative risk, odds ratio, including some exposure adjusted time to event analyses. We can also get a lot more complex with the types of models we fit and really go into mixed or Bayesian models as well in finding certain signals with our adverse event differences. And also we use this data heavily in reviewing just the medical data in either a medical writing narrative or patient profile. So now I'm going to jump right into JMP Clinical with a review that I've built around many of these common analyses. So one of the things you'll notice about JMP Clinical is it doesn't exactly look like JMP, but it is. It's a combined integrated solution that has a lot of custom JSL scripting to build our own types of interfaces. So our starter window here lays out studies, reviews, and settings, for example. And I already have a review built here that is using our example nicardapine data. This is data that's shipped with the product. It's also available in the JMP sample library. It's a real clinical trial, looking at subarachnoid hemorrhage. It was with about 900 patients. And so what this first tab of our review is looking at is just the distribution of demographic features of those patients, how many were males versus females, their race breakdowns, what treatment group they were given, their sites that the data was taken from, etc. So this is very common, just as the first step of understanding your clinical data for a clinical trial. You'll notice here we have a report navigator that shows the rest of the types of analyses that are available to us in this built review. I'm going to walk through each of these tabs, just quickly to show you all the different flavors of ways we can look at adverse events with the clinical trial data set. Now, the typical way data is collected with clinical trials is an international standard called CDISC format, which typically means that we have a very stacked data set format. Here we can see it, where we have multiple records for each subject indicating the different adverse events that might have occurred over time. This data is going to be paired with the demography data, which is one row per each subject as seen here in this demographic. So we have about 900 patients and you'll see in this first report, we have about 5,000 or 5,500 records of different adverse events that occurred. So this is probably the most commonly used reports by many of the medical monitors and medical reviewers that are assessing adverse event signals. What we have here is basically a dashboard that combines a Graph Builder counts plot with an accompanying table, as they are used to seeing these kind of tables. Now the real value of JMP is its interactivity and that dynamic link directly to your data so that you can select anywhere in the data and see it in both places. Or more powerfully, you can control your views with column switchers. Now here we can actually switch from looking at distribution of treatments to sex versus race. You'll notice with race, if we remember, we had quite a few that were white in this study, so this isn't a great plot when we look at it by percent or by counts, so we might normalize and show percents instead. And we can also just decide to look at the overall holistic counts of adverse events as well. Another part of using this as this column switcher is the ability to you know categorize what kind of events those were. Was it a serious adverse event? What was the severity of it? Was the outcome that they are when they recovered from it or not? What was causing it? Was it related to study drug? All of these are questions that medical reviews will often ask to find interesting or anomalous signals with adverse events in their occurrences. Now one of the things you might have already noticed in this dashboard is that I have a control group as column switcher here that's actually controlling both my graph and my table. So when I switched to severity, this table switches as well. This was done with a lot of custom JSL scripting specifically to our purposes, but I'll tell you a secret, in 16 the developer for column switcher is going to allow us to have this type of flexibility so you can tie multiple platform objects into the same columns switcher to drive a complex analysis. I'm going to come back to this occurrence plot, even though it looks simple. Here's another instance of it that's actually looking at overall occurrence where certain adverse events might have occurred multiple times to the same subject. I'm going to come back to these but kind of quickly go through the rest of the analyses and these reviews before coming back to some of the complexities of the simple graph builder and tabulate distribution reports. The next section in our review here is an adverse event incident screen. So here we're making that progression from just looking at counts and frequencies or possibly incidence rates into more statistical framework of testing for the difference in incidence of certain adverse events in one treatment group for another. And here we are representing that with a volcano plot. So we can see actually that phlebitis, hypotension and isothenuria occur much more often in our treatment group, those that were treated with nicardipine, versus those on placebo. So we can actually select those and drill into a very common view for adverse events, which is our relative risk for a cell plot as well, which is lots of lot of times still easier to read when you're only looking at those interesting signals that have possibly clinical or statistical significant differences. Sometimes clinical trials take a long time. Sometimes they're on them for a few weeks, like this study was only a few weeks, but sometimes they're on them for years. So sometimes it's interesting to think of adverse event incidents differences as the trial progresses. We have this capability as well within the incidence screen report where you can actually chunk up the study day, study days into sections to see how the incidents of adverse events change over time. And a good way to demonstrate that might be with an exploding volcano plot here that shows how those signals change across the progression of the study. So another powerful idea with this, especially as you have longer clinical trials or more complex clinical trials, is instead of looking at just direct incidence among subjects you can consider their time to event or their exposure adjusted rate at which those adverse events are occurring. And that's what we offer within our time to event analyses, which once again, shown in a volcano plot looking here using a Kaplan Meier test at differences in the time to event of certain events that occur on a clinical trial. One of the nice things here is that you can select these events and drill down into the JMP survival platform to get the full details for each of the adverse events that had perhaps different time to event outcomes between the treatment groups. Another flavor of time to event is often called an incidence density ratio, which is the idea of exposure adjusted incidence density. Basically the difference here is instead of using some of the more traditional proportional hazards or Kaplan Meier analyses, this is more like a a poisson style distribution that's adjusted for how long they've actually been exposed to a drug. And once again here we can look at those top signals and drill down to the analogous report within JMP using a generalized linear model for that specific type of model with an adverse event signal detection. And we actually even offer some really complex Bayesian analyses. So one of the things with with this type of data is typically adverse events exist within certain body systems or classes...organ classes. And so there is a lot of posts...or prior knowledge that we can impose into these models. And so some of our customers, their biometrics teams decide to use pretty sophisticated models when looking at their adverse events. So, so far we've walked from what I would say consider pretty simplistic distribution views of the data into distributions and just count plots of adverse events into very complex statistical analyses. I'm going to come back now, back to what is that considered simple count and frequency information and I want to spend some time here showing the power of JMP interactivity that we have. As you recall one of the differences here is that this table is a stacked table that has all of the occurrences of our adverse events for each subject, and our demography table, which we know we have 900 subjects, is separate. So what we wanted was not a static graph, like we have here, or what we would have in a typical report in a PDF form, but we wanted to be able to interactively explore our data and look at subgroups of our data and see how those percentages would change. Now, the difficulty is that the percent calculation needs to come from the subject count in a different table. So we've actually done this by formula...like creating column formulas to dynamically control recalculation of percents upon selection, either within categorizing events or, more powerfully, using our review subject filter tool. So here for example, we're looking at all subjects by treatment. Perhaps serious versus not serious adverse events, but we can use this global data filter which affects each of the subject level reports in our review and instantaneously change our demography groups and change our percentages to be interactive to this type of subgroup exploration. So here, now we can actually subgroup down to white females and see what their adverse event percentage and talents are, or perhaps you want to go more granular and understand for each site, how their data is changing for different sites. So what we really have here is instead of a submission package or a clinical analysis where the biometrics team hands 70 different plots and tables to the medical reviewer to go through, sift through, they have the power to create hundreds of different tables and different subsets and different graphics, all in one interface. In fact, you can really filter down into those interesting categories. So if they were looking say at serious adverse events and they wanted to know serious adverse events that were related to drug treatment very quickly, now we got down to a very small subset from our 900 patients to about nine patients that experienced serious adverse events that were considered related to the treatment. So as a medical reviewer this is a place where Ithen might want to understand all of the clinical details about these patients. And very quickly, I can use one of our action buttons from the report to drill down to what's called a kind of a complete patient profile. So here we see all of the information now, instead of at a summary level, at a subject individual level of everything that occurred to this patient over time, including when they had serious adverse events occur and their laboratory or vital measurements that were taken alongside of that. One of the other main uses of our JMP Clinical system along with this medical review, medical monitor is medical writing teams. So another way of looking at this instead of visually in a graphic or even in a table which these are patient profile tables, you can actually go up here and generate an automated narrative. So here we're going to actually launch to our adverse event narrative generation. Again, one of the benefits and values of our JMP Clinical being a vertical application relying on standard data is that we get to know all the data and the way it is formatted up up up front, just by being pointed to the study. So what we can do here is actually run this narrative that is going to write us the actual story of each of those adverse events that occurred. And this is going to open up a Word doc that has all of the details for this subject, their demography, their medical history, and then each of the adverse events and the outcomes or other issues around those adverse events. And we can do this for one patient at a time or we can actually even do this for all 900 patients at a time and include more complex details like laboratory measurements, vitals, either a baseline or before. And so, medical reviewers find this incredibly valuable be able to standardly take data sources and not make errors in a data transfer from a numeric table to an actual narrative. So I think just with that you can really see some of the power of these distribution views, these count plots that allow you to drill into very granular levels of the data. This ability to use subject filters to look either within the entire population of your patients on a clinical trial or within relevant subgroups that you may have found. Now one thing about the way our global filter works through our virtual joins is this is only information that's typically showing the information about the demography. One of the other custom tools that we've scripted into this system is that ability to say, select all subjects with a serious adverse event. And we can either derive a population flag and then use that in further analyses or we can even throw that subject's filter set to our global filter and now we're only looking at serious...at a subject who had a serious adverse event, which was about...almost 300 patients on the clinical trial had a serious adverse event. Now, even this report, you'll see is actually filtered. So the second report is a different type of aspect of a distribution of adverse events that was new in our latest version which is incidence rates. And here, the idea is instead of normalizing or dividing to calculate a percent by the number of subjects who had an event. If you are going with ongoing trials or long trials or study trials across different countries that have different timing startup times, you might want to actually look at the rate at which adverse events occur. And so that's what this is calculating. So in this case, we're actually subset down to any subjects that had a serious adverse event. And we can see the rate of occurrence in patient years. So for example, this very first one, see, has about a rate of 86 occurrences in every 10 patient years on placebo versus 71 occurrences In nicardipine. So this was actually one which this was to treat subarachnoid hemorrhage, intracranial pressure increasing likely would happen if you're not being treated with an active drug. These percents are also completely dynamic, these these incidence rates. So once again, these are all being done by JMP formulas that feed into the table automatically that respect different populations as they're selected by this global filter. So we can look just within say the USA and see the rates and how they change, including the normalized patient years based on the patients that are from just the USA, for example. So even though these reports look pretty simple, the complexity of JSL coding that goes beyond building this into a dashboard is basically what our team does all day. We try to do this so that you have a dashboard that helps you explore the data as you know, easily without all of these manipulations that could get very complex. Now the last thing I wanted to show is the idea of this custom report or customized report. So this is a great place to show it too, because we're looking here at adverse events incidence rates. And so we're looking by each event. And we have the count, or you can also change that to that incidence rate of how often it occurs by patient year. And then an alternative view might be really wanting to see these occurrences of adverse events across time. And so I want to show that really quick with our clinical API. So the data table here is fully available to you. One of the things I need to do first off is just create a numeric date variable, which we have a little widget for doing that in the data table, and I'm going to turn that into a numeric date. Now you'll notice now this has a new column at the end of the numeric start date time of the adverse event. You'll also notice here is where all that power comes from the formulas. These are all actually formulas that are dynamically regenerated based on populations for creating these views. So now that we have a numeric date with this data, now we might want to augment this analysis to include a new type of plot. And I have a script to do that. One of the things I'm going to do right off the bat is just create a couple extra columns in our data set for month and year. And then this next bit of JSL is our clinical API calls. And I'm not going to go into the details of this except for that it's a way of hooking ourselves into the clinical review and gaining access to the sections. So when I run this code, it's actually going to insert a new section into my clinical review. And here now, I have a new view of looking at the adverse events as they occurred across year by month for all of the subjects in my clinical trial. So one of the powers, again, even with this custom view is that this table by being still virtually joined to our main group can still fully respond to that virtual join global subject filter. And so just with a little bit of custom API JSL code, we can take these very standard out-of-the-box reports and customize them with our own types of analyses as well. So I know that was quite a lot of an overview of both JMP Clinical but, as well as the types of clinical adverse event analyses that the system can do and that are common for those working in the drug industry or pharma industry for clinical trials, but I hope you found this section valuable and interesting even if you don't work in the pharma area. One of the best examples of what JMP Clinical is is just an extreme extension and the power of JSL to create an incredibly custom applications. So maybe you aren't working with adverse events, but you see some things here that can inspire you to create custom dashboards or custom add ins for your own types of analyses within JMP. Thank you.  
Bill Worley, JMP Senior Global Enablement Engineer, SAS   In the recent past partial least squares has been used to build predictive models for spectral data. A newer approach using Functional Data Explorer and covariate design of experiments will be shown that will allow for fewer spectra to be used in the development of a good predictive model. This method uses one-fourth to one-third of the data that would otherwise be used to build a predictive model based on spectral data. Newer multivariate platforms like Model Driven Multivariate Control Charts (MDMCC) will also be shown as ways to enhance spectral data analysis.   (view in My Videos) Auto-generated tr     Auto-generated transcript...   Speaker Transcript Bill Worley Hello everyone, my name is Bill Worley and today we're going to be talking about analyzing spectral data. I'm going to talk about a few different ways to do it. One is using functional data explorer and design of experiments to help build better predictive models for your spectral data. The data set I'm going to be using is actually out of a JMP book discovering partial least squares. I will post this on the our discovery website, though, or community page so everything will be out there for you to use. First and foremost, I'll talk about these different things we're going to look at. So traditionally when you're looking at spectral data you're going to use partial least squares to analyze it and that's fine. And it really works very well. But there are some newer approaches that try it out and One is using principal components and then using a covariant design of experiments and partial least squares, then to analyze the data. And then even newer approach and more novel approaches using functional data explorer, then the covariance design of experiments partial least squares and an opportunity to use something like generalized regression or neural networks. Okay, so I'm going to go through a PowerPoint first give you a little bit of background. Okay. And again, we're going to be talking about using functional data explorer and design of experiments to build better predictive models for your spectral data. A little bit of history. The spectral data approach is based on a QSAR-like material selection approach that was developed previously by gentleman named Silvio Michio and Cy Wegman. So I took it and looked for opportunities to help build this approach with other highly correlated data. The first thing that really came out was the spectral data which is truly highly correlated almost auto correlated data that we can look at and use this approach. The data that I've got is, again, this continuous response data for octane rating. But I've since added, looking at mass spectral data in near IR data for categorical responses as well. This is where we're going to go We're going to build these models, we're going to compare them. And we're going to look at This is the traditional PLS approach. This is the newer approach using principal components. And then the final approach here is using functional data explorer and you can see that for the most part, we really don't lose anything with these models as we build them. As matter of fact, the slides a little bit older, the models that I've built more recently are actually a little bit better. So we'll get there. We'll show you that when we get there. So again, a twist on analyzing your spectral data partial least squares has been used in the past. We're going to be applying several multivariate techniques to show how to build good predictive models with far fewer conditions. So when I say far fewer conditions, in this case, I mean less spectra. So you'll see where that comes from. And why would you want to do this analysis differently? Well, first and foremost, there's a huge time savings. You get as good or better predictive model with 25% of the data or less. It's your choice, and then you can use penalize regression to help determine the important factors, making for simpler models. And when I say important factors, I mean important way wavelengths and again, you'll see that when we get there. This is looking at 60 gasoline near IR spectra overlay and might, as we all know, that would be pretty hard to build a predictive model to determine the difference between the different spectra for their octane rating. So what we're going to do is use JMP to help get us there. And most of this kit that I'm going to be showing can be done in regular JMP but what I'm going to be showing today is almost all JMP Pro. Okay, just so you know. So how it's done. I'm not going to read you the different steps. I'll let you do that if you so choose. But there are two important ones. First is number two, when you want to identify these prominent dimensions in the spectra. And that's where we're going to use functional principal components from the functional data explorer. It's not used in the traditional sense, because we're not going to build models using these things. We're going to use these functional principal components to help us pick which spectra we're going to analyze and then we're going to use those in a custom design to help us select those different spectra. And last but not least, this number seven here is use this sustainable model to determine the outcome for all future samples. This that's a little bit of it. I'm a chemist by training by education and an analytical chemist at that and overall I don't know how well a calibrated or instruments hold their calibration anymore. So this holds true, the model that you build will hold true as long as the instrument is calibrated and good to go from that respect. Okay. Bill Worley So some important concepts. And again, I'm not going to read these. I just want you to know that will be looking at partial least squares, principal components, and functional data analysis. Functional data analysis, really this is something newer in JMP that a lot of our, you won't see in other places. It helps you analyze data, providing information about curved surfaces or anything over a continuum. And that's taken from Wikipedia. Okay. A newer platform that I'm going to show that is in regular JMP is multivariate control, model driven multivariate control charts. And this allows you to see where the differences in some of the spectra and how you can pull those apart and maybe dive a little deeper into where you're seeing these differences in the spectra of what they really are. So with that, let's go to the demo. And you go to my home window. So, The data set. This is the again the gasoline data set where we're looking at octane rating as, you know, how do we determine octane rating for different gasolines. Right. So where do they come from, how do we determine that. You really don't need this value until you do this preprocessing or setup that I'm going to be showing you. So we'll get there, we know those numbers are there. We'll get there as we need to. But let's look at the data first. And that's always something you want to do as you want to do anyway. Whenever you get a new data set, you want to look at the data and see where things fall out. So let's go to Graph Builder. We're going to pull in octane and the wavelengths. Alright, so we're going to drop those down on our x axis. And before I do that, let me close that out. I want to color and mark by the octane reading and I'm going to change the scale. The green to black to red Say, Okay, let's see that colors in the data set. Let's go back to Graph Builder. And we'll pull this back in. Drop those. Little more colorful there. Now we've got these there, it's really hard to tell anything at all. I mean, we wouldn't know anything, you know. What we saw before the overlay was bad enough, but we're looking at, you know, more jumbled grouping of points, but let's turn on the parallel plots. Alright, so again, that kind of pulls things in and we can see again a jumbled mess, but we've got another tool that will help us investigate the data a little further. And that's the local data filter. So we're going to go there and we're going to pull in sample number and octane rating. We'll add those you transfer this out a little bit so it's not so See that. So now we could actually go into the single spectra. See over here in the green, so we can dive into those separately. I'm going to take that back off. Alright, so that's grouping and that we could actually pull this in and start looking at the different octane ratings right and see which spectra associated with the higher octane ratings are the lower. It's your choice. It's just gives you a tool to investigate the data. Do you need to do any more pre rocessing to get the spectra in line with each other or setup where you get you can see the different groupings better. Okay, so that's looking at Graph Builder. And I'm going to clear out the row states here. From here, we want to better understand what's going on with the data. Like I said, this is We're looking at spectral data and it's very highly colinear or multi colinear and this is something you may want to prove to yourself. So let's go to analyze multivariate methods, multi variant. And we're going to select all our wavelengths Right. And fairly quickly, we get back that You know, we get these Pearson correlation coefficients. And they're all closer to one, right, for the most part in these early early wavelengths. And that's just telling us that things are very highly correlated. So, you know, they'll figure that's that's the way it is. And, you know, we need to deal with that as we go forward. Okay, so we're looking at the data, we're set up and now we can look at another piece of information and this is newer in JMP 15 It's also in regular JMP. So we're going to go to Analyze Quality and process, and model driven multivariate control chart. Okay, so again we pull in all our wavelengths Okay, say Okay. And now we're looking at the data in a different way. This is basically for every spectra, all 400 wavelengths. And now we can see where some of these are little bit out of what would be considered control. All right, for all 400 wavelengths. That's the red line. So if we highlight those Right, I know if I right click in there and say contribution plots are selected samples. Now I can see Differences in the spectra compared to some, you know, as they're compared to the other spectra in the overall data set, we can see which parts of the spectra that are considered more or less out of control. And if I can get this to work. We can get there and then for that particular wavelength, we can see those three samples, you know, are out of spec, more or less out of spec, based on this control chart for the compared to the other samples. Alright, so again, that allows you to dive deeper and you know, tells you what group it is, again, this is all about learning more about your data, which ones are good, which ones are bad or which ones may be different. Right. So it's an added tool to help you better understand where you may be seeing differences. Okay. So with that, we've got things pretty much set up right and we want to go into the the analysis part. So as we go into Analyze, we have to set things up so we get what we want, when we want and how we want to analyze it. So we're going to go to Analyze and this is where we're going to select the samples. This is where we're going to use functional data explorer to help us select the samples. Alright, so go to Analyze, functional data explorer. And this is a JMP Pro 15 thing. So we're going to use instead of stack data, we're going to use rows as functions. And again, we're going to use select all of our wavelengths We're going to use octane as our supplementaries variable, right. And then the sample number is our ID function. Right, so we've got it set up, ready to go. And now for looking at the data. Remember how we had everything lined up and we were looking at it before. So this is all the data overlaid again. And if we needed to do some more preprocessing, we can do that over here in the transform area where we could actually center the data and standardize it For the most part, this data is fairly clean. We don't have to do that. And we're going to go ahead and do the analysis from here. Okay, so b-splines, p-splines and fourier basis. These clients will give you a decent model and a fairly simple model. Spectral data is again so highly correlated and the data, the wavelengths are so close, we want to understand where we're seeing differences on a much closer basis as opposed to something like a b-spline, which would spread the knots out. All right. We want to cap the knots as close together as possible to help better understand what's going on. So this takes a few seconds to run, but I'm going to click p-splines And it gives you an idea of, you know, so it's going to take, I don't know, 15 or 20 seconds to run, but it's going to fit all those models. And it's almost done. Alright, so now we fit those models. Now if I had run a b-spline, it would probably would have been around 20 knots and most We're looking at 200 knots. So it's basically taking that those 400 wavelengths and split them into virtually groups of two, right, so it's looking at individual, like individual groups of two And this is the best overall model based on these AICc BIC scores negative log likelihood. We could go backwards. It's a linear function, we could go backwards and use a simpler model, if we want. We could also go forward and see how many more knots would take to get you an even better model. I can tell you from experience. That around 203 to 204 knots is as good as it gets. And there's no reason to really go that far, you know, for that little bit of effort, or a little bit of improvement that we would get SO fit those now. The, the, you can see we fit all the models are all the spectra and let's go down to our functional data explorer or functional principal components. This is the mean spectra right here. These are the functional principal components from that data. Okay. And each one of these Eigenvalues are punctual principal components is explaining some portion of the variation that we're saying, all right. So you can see the first functional principal component component explains about 50% of the variation, so on and so forth. And it's additive You can see it about the second row at 72 and then number four we get to basically our rule of thumb or cut off point, that if we can explain 85% of the variation That's our cut off for the number principal components we want to grab and build our DOE off of. And so we're going to go with four. Right. Some other things you can look at are like the score plots. So looks like spectra number five is kind of out there. And if you really want it to look at that one, you could as well so you can pop that out or pin that to the graph. But you get an idea of, you know, which spectra is out there and what it might look like, in this case we see some differences to 15 and five, remember 41 was kind of out there too. But we can see some other things. The functional principal component profiler's down here. Now if you wanted to make changes or you wanted to better understand things, then you would say, Okay, you know, as I move my functional principal components around. What do I do, you know, how do I change my data? Well, It's hard to really visualize that. So something that's newer in JMP 15, JMP Pro rope 15 is this functional DOE analysis. And that's why I added that Octane rating to our supplementary variable. Alright, so I'm going to go down here, minimize some of these things a little bit Right. And down here on this FDE profiler, we've actually done some generalized regression. It's built in. So we're looking at that and this model as we look through these different wavelengths, we can see that that octane, what happens with the octane, we get to these different wavelengths, right, so that particular wavelength may be different for the different octane ratings, and that's what you're looking for, right. You want to see differences. Right. So, where, where can we see the biggest differences? Well, I don't know if you saw that happen out here, but right here on the end and the higher wavelengths, we're seeing some significant changes. So I'm going to go out here and as you can see that's bowed a little bit on the curve there and as I go back to the lower wavelengths, this curve starts to flatten out a little bit or actually gets a little steeper. It's not as flat as it was at the higher octane ratings. OK. So again, this is all about just again, investigating the data. But what we're going to do is go ahead and save those functional principal components. Right. And we'll do that through our function summaries right here. We need to customize that. I'm going to deselect all the summaries. And I'm going to put four in there because that's the number I want to save And just as a watch out, Make sure you say okay and save. Okay, it's fine. It just won't get you where you want to be. So we're going to say, okay, and save. And we get a new table with our functional principal components in there. The scores. Right. So all four of those for the different samples, for the different octane ratings. Now what we have to do is get that information back to our main data table. And you could do this through a virtual join. What I have done is actually I copy these and there's a way to do this fairly simply. So I need to actually go back over to my main data table. And if this will work for me is if I just, I don't need to keep this table. I just want to get these scores over to this table. And if this little... I get this All right, so you just grab them and drag them over to your main data table and drop them. I've already done it, so I'm not gonna drop them in there. But that's one way to get the data over there quickly. All right. Oh, let me minimize that for now. So this is the data. I mean that it's right there. Alright, so I've copied it over. So we've got the scores. Now we're going to do what I consider is the most important step, we're going to pick the samples that we're going to analyze. This is going to get you down to a much smaller number of samples to build your models on. Okay. And this is where we're going to use design of experiments to get us there. So we're going to select DOE, custom design. And we're going to Don't worry about the response. Right now we're going to add factors and we're going to add covariate factors. Right. And you'll see why in a minute why we're doing this. So I'm going to add covariants. Right, and you have, you have to select what your covariate factors are and we're going to choose the functional principal components. We're going to say, okay. And we're going to look in this functional principal components based to figure out which samples. We're going to analyze all right to build our model. Right. So I select continue. And right now it's saying I, you know, select all 60 and build that model from there. Well, we want to take that down to a much smaller number. We're going to use 15. Okay. So that allows us to again select smaller number. We don't have to have as many spectra. We don't have to run as many. But you have them and then you can select from them. So I'm going to say make design. And while this is building Alright, so we don't need this. I'm going to get rid of this. That's just some information. But what we've seen now is in our data table 15 rows have been selected, they're highlighted in blue. I'm going to go ahead and right click on that blue area and put some markers on them. Put star on. I'm actually going to color those as well. Right. So let's take those blue Okay. And Before I forget to I want to build what you do now is you take these and do a table subset. Right, so table subset. We've got selected rows, all columns, say, okay, and this is where we're going to be doing our modeling. But before I go there, let's go back to our data table and main data table and go to Analyze, multivariate methods, multivariate, right. And then use instead of using the wavelengths, I'm going to use the functional principal components. Put those in our y row, say okay, and now look at the, you know, what we saw before, we had almost complete correlate...complete correlation for a lot of the wavelengths. We've taken that out of play. And if you're looking at the space now, the markers as you see them -- the stars. We're looking at pushing things out to the corners of our four dimensional space, but we're also looking through the center of the space as well. So, this is more or less a space filling design, but it's spreading the points out to a point where we're going to get, hopefully get a decent model out of it. Okay. So we've got that. And I need to pull up my data table again. Pull this one up. Okay, so these are the samples that we're ... again that we're interested in. We're going to build our model on. And I'm going to slide back over here to the beginning. So these are the rows that were selected. And now we're going to go to Analyze. Fit model. And we want...octane is what we're looking for, right. So that's our rating that we're looking for. And we're going to use all of our wavelengths And this is also the next thing I'm going to show you is a JMP pro feature. Where you select partial least squares and the fit model platform. This you can do partial least squares or you can do the same analysis and regular JMP, just so you know that, but we're setting it up here in case if you wanted to, you could actually look for interactions. We're not going to worry about that on this model. And select run And got to make a few modifications. Here you can choose which method you want, the NIPALS or the SIMPLS. The SIMPLS is probably a little more statistically rigorous but NIPALS works for our purposes. The validation method, we do want to do that. But we don't have very many samples. So we're going to use the leave one out. Okay, so each row will be it's... pulled out and used as the validation. Okay. So we're going to start and just say go. As you can see up here on the percent variation explain for the X and the Y, we're doing very well. The model is explaining quite a bit of the variation for both the X and the Y, 90 almost 100%. That's great but they're using nine latent factors. Remember, we only had four functional principal components. So let's see what happens when we go to that. Change to four. And select go and we do lose something in our model, but it's not bad, right, so we're still getting a decent overall fit. And that's where we're going to go. Alright, we're going to use that model instead of the more complicated model with the nine latent factors. So I'm actually going to remove this fit. OK, and then we're going to look at this four factor partial least squares fit. What we're looking for down here is that the data isn't spread out in some wild fashion. They are, you know, for the score plots, the data is somewhere close to around this, the fit line, and we're okay with that. And if we're looking at other parts of this, we've got, look again, we're looking at what's...how much of the data is being explained, what are the variations being explained and we're looking at 97% there, almost 99% here for Y, and that's good. Let's look at a couple of other things while we're here, And look at the percent variation plots, which gives us an idea of, you know, how are these things different or how are these spectra are different and we can see that latent factor one is explaining a fair amount of the differences but latent factor two is explaining the better, more important part of that. Alright, so that's where we're kind of dialing into; three and four are still part of the model but they're not as important. So something else we can look at are this variable importance plot. There is a value here. It's a value of .08, right here that dotted red line. If you wanted to do variable reduction, you could do that here. Alright, so you could actually lower the number of Wavelengths you're looking at here, but we're going to leave that as is, right. And the way to do that to actually make that change, to actually do the variable reduction would be through this variable importance table coefficients and centered and scale data, you could actually make a model based on the variables, the important factors. Right. And you can see this again, that dotted line is the cutoff line and a fair number of those wavelengths would be cut out of the model. Right. But again, we're going to keep that and we're going to go up here. We're going to go to the red hotspot. Go to save columns and save prediction formula. Okay. Alright, so let's save back to the data table. Going to minimize that. And we've got this formula out here, right. That's our new formula. And if I go to Analyze, Fit Y by X, go to octane. And we're going to grab that formula. Say okay. All right, and Great, we fit the model and our R squares are around .99. And that's really great. But the problem is, how does that work for the rest of the data? Well, I'll show you that in a minute. But before we get there, I want to show you a separate method or another opportunity and I'll show you the setup and I'll show you the model. You would do analyze fit model. We're going to do recall. And this time, instead of using partial least squares. We're going to use generalized regression. Select that. We've gotten normal distribution, again we can go...we can go ahead and select Run. Right. We're going to change a few things here. Instead of using lasso, we're going to use elastic net. And then under the advanced controls, we'll make a change there in a second. But this validation method, remember we used leave one out. So we'll change that. We're going to select the early stopping rule. And we're also going to, under the advanced controls, make this change here. So this, this is this kind of drives why I even use generalized regression at all. It helps make a simpler model, but if you blank that out, blank out that elastic net alpha and then run your model. If I click Go, it steps through lasso, elastic net, and retrogression all the steps, through all those, so it fits all those models or tries to fit all those models and then it gives you the best elastic net alpha. Well, doing that takes a little bit of time, okay, because you're building all those different models. I'll show you the outcome of that in this fit right here, which I had done earlier. So this would be the actual output that I got from that model, again leave one out. And this gave me 41 nonzero parameters. Right. If I show you the other model, that partial least squares model is 400 wavelengths. So we've basically reduced the number of active factors by a factor of 10, right, with this elastic net model, right. And we can look at the solution path and we can change things and we can actually reduce the number of factors we want or add more, but for the most part, we'll just leave the model as is. We would save this model back to our data table. I've already done that. And now let's compare those. That's this model right here, or the information right over here on the left. Passed it too fast. This highlighted column, right, so if I right click there and go to formula, right, so I can look at that and, again, these are the important wavelengths. Alright, so that's the important wavelinks for that model for predicting octane. If I get, if I look at the partial least squares model, I click Go to formula there. This is the partial least squares model. And again, all 400 wavelinks. So it's much more complicated, complicated model. And again, you're, you know, you're more than welcome to use it. It's actually a very good model. So there's no reason not to use it, but if you can build a simpler model, it's always a good thing. Alright, so we've got these formula in our new data table or subset table and we want to transfer those back to the original data table, right. So again, right click formula, copy this formula, right, and then you would go back to your data table, make a new formula column over here. Right click Go to formula and paste that formula in your data set. All right. Well, I've already done that to save some time. Okay. And we've got, I've got both models there. I've got the partial least squares model in there, and really, what we're going to come down to, is we're going to go to Analyze fit Y by X. And we're going to go to octane rating, right, and I've previously done the PLS analysis. Now this model was built with a...48 samples were the training set and 12 for the validation side. Alright, so that's there. I've got my generalized regression formula, and I've got my octane prediction formula. Actually, this is the other PLS approach right here. And this one. And we're going to add those, and we're gonna say okay and compare those. And now you can see in here where we're doing very well overall. The models are doing well. We're still doing about 97% for our generalized regression model, in the end, which is still good. The PLS model beats it out a little bit, but then, remember, that's a much more complicated model. And overall, you know, we've built this nice predictive model that we can share with others. And as you get new spectra entries, analyze new spectra, all you have to do is drop those wavelengths into your data table and see what the octane rating is. All right. So you've made that analysis, you've made that comparison. And if nothing changes, in the day or over a period, of course, of a couple of days with your calibration, this should be a good model. It should be a sustainable model for you. So with that, I believe, I'm going to go back to, well, let me show you one more thing. I'm going to go to another data set that I wanted to share with you. This is the... as I'm trying to find it...I go to my home window... And this is a mass spec study for, actually it's a prostate cancer study and there's some unusual data with that. Right. And I'll want to show you... there's a couple of different ways, but what I want to show you is, pull this in here. Right. So instead of...this is abnormal versus normal status and... Showing you the power of the tools for...let's go to Analyze and then all the process, model driven multivariate...before I go there, let me color on status. Alright, so we'll do that, we'll give them some markers here. Okay. We're gonna go to Analyze... quality and process, model driven multivariate control chart. All of our wavelengths. Right. It takes a second to output, but I've got all the, right now, looks like I've got all the normal data selected, right, so that's what you're seeing there, if I click there. The red circles are the abnormal data, and for the most part, we see that there's a lot more of those out of control, compared to the normal data, right. The nice thing about this is if I could pull up one of those, I can start seeing which portion of the data is different than what we're seeing with the so called in control or normal data, right, and... Oh, Back to that. There. Gonna show you something else. Go to...we want to monitor the process again or look at the process a little deeper. So let's go to the right hotspot, monitor the process and then we're going to go to the score plot, right. So now we can compare these two groups. Well, we have to do a little bit of selection here. So let's go back to the data table. Right click, say select matching cells. And we're gonna go back over here and that's all selected, so that we're going to make that abnormal group, our group A, right. Go back to the data table. And scroll down. Select normal. Select matching cells and now that's going to be our Group B so now we can compare those. And now we can see where there's differences in the spectra, like, so this is maybe on the more normal side that you won't see in the abnormal side, right. But you're gonna...there are a lot more differences that you're going to see in the abnormal side that you would not see in the normal side, right. So this allows you to, again, dig deeper and better understand that. And finally, if I do this analysis for the functional data explorer with this grouping... Again rows as functions. Right. Y output. Status is our supplementary variable. Sample IDs, ID function. Say okay. And we'll fit this again with a P-spline model. This will take a second. While we're waiting for this to happen, I'm just going to show you, at the end, the generalized regression portion of this will be done, but I just want to show you what it's like looking at a categorical data set with the functional data explorer. Using that functional DOE capability. It ends up being, could be very valuable. And when you're looking for differences in spectra. And again, this is mass spec data. This isn't your IR data. This is mass spec data. We fit it, we've looked at our different spetra, how it's fit and we're happy with that. We can look at the functional principal components. Can look at the score plots. Let's look at the functional DOE. And again, where do we see differences? If I go over here and we're looking at abnormal spectra. It doesn't have this peak that the normal does, right, so now we can look at that and see, you know, again, help us better understand what differences we might see. All right. And in closing, let's go to back to the PowerPoint. Alright so what this process allows you to do is compress the available information with respect to wavelengths or mass or whatever it happens to be. Use this covariate DOE to help you select the so called corners of the box for getting a good representative sample of data to analyze. Model that data with a partial least squares, generalized regression. You can also use more sophisticated techniques like neural nets. And as new spectra comes in, you put the data into the data table and you see where it falls out. So this is highly efficient or helps you be more highly efficient with your experimentation and your analysis. And again, build that sustainable empirical model. Looking forward, the data that I've used is fairly clean and we're looking at working with the our developers and looking at how we can preprocess the spectral data and get even better analysis and better predictive models.  
Christian Stopp, JMP Systems Engineer, SAS Don McCormack, Principal Systems Engineer, SAS   Generations of fans have argued as to who the best Major League Baseball (MLB) players have been and why, oft citing whichever performance measures best supported their case. Whether the measures were statistics of a particular season (e.g., most home runs) or cumulative of their career (e.g., lifetime batting average), such statistics do not fully relate a player’s performance trajectory. As the arguments progress, it would be beneficial to capture the inherent growth and decay of player performance over one’s career and distill that information with minimal loss. JMP’s Functional Data Explorer (FDE) has opened doors to new ways of analyzing series data and capturing ‘traces’ associated with these functional data for analysis. We will explore FDE’S application in examining player career performance based on historical MLB data. With the derived scores we will see how well we can predict whether a player deserves their plaque in the Hall of Fame…or is deserving and has been overlooked, as well as compare these predictions with those based solely on the statistics of yore. We’ll confirm Ted Williams really was the greatest MLB hitter of all time. What, you disagree?! Must be a Yankees fan…     Auto-generated transcript...   Speaker Transcript Christian So thank you, folks, for joining us here today at the JMP Discovery Summit, the virtual version. My name is Christian Stopp. I am a JMP systems engineer. And I'm joined today by my colleague Don McCormack, who's a principal systems engineer for JMP as well. And you probably got here because you saw the title of the talk. And you saw this was...you're a baseball fan about Major League Baseball players and wanted in or you saw it was about functional data explorer and you wanted to learn a little bit more about how to employ functional data explorer in different environments. So we're going to marry those two topics today. Don and I and I'm going to gear my conversation a little more for the baseball fans first. Just as we're having kind of common conversations among baseball players and baseball fans, you might think about how your favorite player does relative to other players and you might have with your friends, these conversations and hopefully they're kept, you know, polite about about who your favorite player is and why. And so that's kind of how I imagined this infamous conversation between Alex Rodriguez and Varitek going was just about who...comparing notes about who their favorite player was. And so for me, my origin started off, and like Don's, with respect to just be having a love for baseball and being interested in the baseball statistics that you'll find in the back of the bubble gum cards we used to collect. And so as you have these conversations about who your favorite player is, you might note that players differ with respect to how good they are, but also different things like when they age... as they age, where they peak, like where the performance starts to go off over time. And so as you're thinking about maybe like me the career trajectories of these players, you might want to question, Well, how do I capture or model that performance over time? Now, if you're oddly like me, you decide that you want to pursue statistics so that you can do exactly that. But I would encourage you to skip that route and be smarter than me and just use a tool like functional data explorer to help you turn those statistics...statistical curves into numbers to use for your endeavors. So for those of you who are a little less familiar with baseball, but what we'll be seeing is data reflecting things that are measures of baseball performance. So I'm going to be speaking about position players and position players bat. And so one of the metrics of their batting prowess is on-base percentage plus slugging percentage or OPS. And so on the Y axis, I've got that that measure for a couple of different players as they age. And the blue is Babe Ruth and the red is Ted Williams. And as you can see, you get a sense from these trajectories that they both appear to have about the same quality of performance over most of their careers. But you might know that where they peak might seem to be a little at an older age for Ted Williams, as opposed to maybe Babe Ruth. And Babe Ruth, it looks like he maybe needed just a little bit of time to just get up to speed to get to that measure if you're just looking at this plot without any other knowledge. So there's a lot of...this is just two players in the thousands of players or tens of thousands that you might be considering and just look at comparing, you can imagine there's a lot of variability about these characteristics of their career trajectories. So there's also clearly variability within a player's trajectory, too. So I might use the smoothing function of the Graph Builder here and just smooth out the noise associated with those curves a little bit, to get a better sense of the signal about that player's trajectory. And it turns out that that smoothing is is very similar to what's going on in that process that functional data explorer employs. So here I've got functional data explorer and again I'm...my metric here is on-base percentage plus slugging percentage, OPS. And I'm looking just to see...like we're comparing these these player trajectories, now, in FDE is, functional data explorer, is smoothing out those player curves, as you can see, and then extracting information about what's common across those curves. And so for every player now, what you get in return for doing that is, are scores that are associated with that player's performance. And so these scores describe the player's career trajectory in a nice little quantitative way for us to take away and use another analyses like we'll be doing. So it's just, you can see that a little bit, these are Hank Aaron scores. And in the profiler that you'll...that you can access in the functional data explorer, you can actually change...you can look at that trajectory here for that player's OPS over age and then change those values to reflect what that player's scores are and get a better...replicate their their career trajectory with those scores. Right, so that's a little bit about FDE and and how to employ it here. So you'll see Don and I talking about these statistics that we're now equipped with, these player scores that we get out from the functional data explorer, that gets it from those curves that we started off with. And so we're going to use that...some what we're doing is predicting like maybe Hall of Fame status. And not only who's in the Hall of Fame that they belong there, or more more interestingly, like maybe, who are the players who are in the Hall of Fame that maybe shouldn't be because the stats don't support it or maybe identify players who the Hall of Fame committee seems to have snubbed. So we'll talk a little bit about just the different metrics that we used and how we kind of revised them. And then taking those career trajectories using FDE and then getting the scores out and doing the prediction, like we normally would with other things. So if you haven't followed baseball, the Hall of Fame eligibility...eligibility requirements are that a player had to play at least 10 years, so 10 seasons, and had to wait...you have to wait five years before you're eligible. And then you have 10 years during which you're eligible and folks can vote you in. So there's a couple of players we'll see that are still have...that are still waiting for the call. The hall uses a different selection criteria are primarily around how well the player performed, but also take into account these other things that the data source we're using, Lahman Database, doesn't include, so it's hard to measure. So we're just stick with analyses that reflect their statistical prowess on the on the field. And of course after, you know, 150 years of baseball players playing baseball, you might recognize that they're playing in different eras. And so we want to make sure that we're comparing the players to their peers. And so we're going to take that, you know, maybe the year that they played into account, or the position that they played since different requirements are associated...would typically be associated with different positions. And then different leagues have different rules; we'll weigh that in, too. That's where I'm gonna stop. Don's gonna kick over to pitching and then I'll come back and talk about position players. donmccormack So like Christian said, I'm going to talk a little bit about pitching but while I'm doing that, before that I'm doing that, what I would like to do is, I would like to illustrate some of those initial points that that Christian mentioned. The things that are good data analytic techniques, things that really need to be done, regardless of what modeling technique that you use, however, it turns out that they are good things to do before you model your data using FDE. I'm going to talk specifically about cleaning the data, about normalizing the data, so you can compare people equally, and then finally modeling the data. So as an illustration, what I've got...what you see on the screen right now, we are looking at three very different pitchers that are all in the Hall of Fame. The red line is Nolan Ryan, a very long career, about a 27-year career. The green line, the middle line, that's Hoyt Wilhelm. For some of you younger folks, you might not know who Hoyt Wilhelm; he pitched starting in the early 50s through 72, I believe. Fairly long career; spanned multiple eras. He was mostly a reliever but not a reliever like you might know of the relievers today. He's a guy who when he went out to relieve, yYou know, he might pitch six innings. Okay, so very, very atypical from the relievers today. The blue line is Trevor Hoffman, great closer for the San Diego Padres. But again, very different pitcher. So question is, I mean, what do we do, how do we get this data ready and set up in such a way where we can compare all three of these people equally? So first thing I mentioned is we want to clean up your data. And by the way, I'm going to use four different metrics. I'm going to use WHIP (walks and hits per innings pitched), strikeouts per nine, home runs per nine and a metric I've easily created called percent batters faced over the minimum, where I've just taken the number of batters a pitcher's faced divided by the total outs that they've gotten and subtracted one. The idea here is that if every batter that was faced made it out, then that would be a perfect one. Okay, I'm going to look at those four metrics. I've got different criteria in terms of how I define my normalization, in terms of how I am screening outliers and I'm going to include a PowerPoint deck for you to look at to get the details, but I'm not going to talk about them here for the sake of time. So first thing I'm gonna do is going to clean up the data. So you'll notice that, for example, that very first year Nolan Ryan pitched three innings pitched; very, very high WHIP. As a couple of seasons in here, I think that Trevor Hoffman pitched a low amount. So, so I'm going to start by excluding the data. That's nice. It's shrunk the range and it's always good to get out, get the outliers out of the data before you do the analysis. One other step that that I want to mention is that when I did FDE, when I used FDE on this data within the platform, it allows you to do some additional outlier screening where, even if you have multiple of columns that you're using, you only are screen...you're not screening out the entire row; you're only screening out the values for that given input, which is a very, very nice feature. So I use that as well because there were still, even with the my initial screening, there was still a few anomalies that I needed to get rid of. clean the data. Normalize it is the second. So by normalization, what I've done is, I basically normalized on the X axis. And I've normalized on the Y axis. So, what we're looking at here is the number of seasons. So each one of these seasons is taken as a separate whole entity, but we all know that in some seasons, some pitchers throw more innings than other seasons. So rather than looking at seasons as my entities, I'm going to look at the cumulative percent of career outs. So I know that, I know that at the end of the season pitchers made so many cumulative career outs, and that's a certain proportion out there, whole or total career outs. So I'm going to use that to scale my data. Now the great thing about that is, you'll notice that now all three pitchers are on the same x scale. Everything, everything is scaled from zero to one. So, so, really nice... from the standpoint of FDE analysis, a really nice thing to have. And then finally, I want to scale on the Y axis as well. And all I've done is I've divided the WHIP by the average WHIP for the pitcher type and for the era that they pitched in. So I have a relative WHIP. Now the other nice thing about about using these relative values is that I know where my line in the sand is. I know that a pitcher that has a relative WHIP of one is is an average pitcher. So in this case, I'm going to be looking for those guys that throw with WHIPs under one. So you'll notice that all three of these pitchers for the most of their career, they were under that that line at one. Now the final thing I'm going to do, is I want to use my FDE to model that trajectory, the trajectory. Now, one of the problems with using the data as is, the two problems with using the data, as is. One is that it's pretty bumpy, and it would be really hard to to estimate what the career trajectory is with all of these ups and downs. Second thing is, eventually what I want to do is, I want to use that metric that I've generated from the FDE, this trajectory to come up with some overall career estimate. So rather than looking at my seasons or at my cumulative percent as discrete entities, I want to be able to model that over entire continuous career. And we'll see that a little bit later on. So I am going to replace my percent my...I'm sorry about...conditional...my WHIP, my my relative WHIP with this conditional FDE estimate. Now, you might have seen me flip back to those two, you might say, oh boy, that is what a...what a what a huge difference between the two, is that really doing a good job? Kind of hard to tell from that graph. So, so what I, what I want to do is I'm going to actually show you what that looks like. So here what I've done is I pulled up the, the, the, the discrete values. This is Nolan Ryan, by the way. The discrete measurements for Nolan Ryan, along with his curve for his for his conditional FDE estimate, you'll see that it doesn't follow the same jagged path or bumpy path, but it does a good job estimating what his career trajectory is. And in general with his WHIP high, at first, he walked a lot of people, was a very, very wild pitcher, much more wild in the beginning part of his career, believe it or not. But as his career went on, that got better. And this is you'll see this in in any of the pitchers that I that I picked. So for example, if I go to, let's go to Hoyt Wilhelm. Here's Wilhelm. Again it doesn't capture the absolute highs and lows, but it does a good job at modeling the general direction of where, of where his career went. Okay, so let's let's use that to ask. I only have a limited amount of time. I wish I had more time because there's just some neat things I can show you. But I'm...I'm going to start with what I call the snubbed. Okay so these are the players that...so I used FDE on those four metrics I'd mentioned. I use those as inputs, along with the pitcher type and I tried a whole bunch of predictive modeling techniques. The two that that worked the best for me were naive Bayes and discriminate analysis. And I use those two modeling techniques to tell me who got in...who should be in and and and and who shouldn't be in and and that's what...what we're looking at here is, we're looking at those pitchers where both the naive Bayes and the discriminate analysis said yes, but the Hall of Fame said no. So these are my...this snubbed. So you'll notice that in this case...and let me switch to this. This is the apps. This is the relative WHIP. Let's go with the conditional WHIP. And let me go ahead and put that reference line back in there at one and you'll see, for the most part, these are pitchers, who spent the top...the bulk of their career under that one line. Now the other thing that you might might think of, looking at this data, is that wow, it would be really hard to tell these players apart. How do I compare these now, if I if I were to put, let's say, a few pitchers that were in the hall in this list, too. I mean, they would be...it'd be hard to separate them just by eyeballing them, because some of their career, they would be better than others, and they would switch on other parts of their careers. How do I, how do I deal with this on a career level? So as I mentioned earlier, one of the nice things about functional data explorer is that I can take that data, and I can I can I can create a career trajectory. Estimate a whole bunch of data points along that career trajectory. And I did that I actually broke up careers into 100 units and I summed over all those hundred units for each one of my curves. So basically, what I did is I got something like an area under the curve. If it were above that one line, I'd subtract, if it were below that one line, I would...I'm sorry...if I were above that line, I would add; below the line I would subtract. And if we look at total career trajectories...this is a, this is actually...this is a larger list. This is approximately 1300 or 1400 pitchers, so absolutely everyone who was... absolutely everyone who was Hall eligible, 10 years or more. So let's let's really quickly go into a couple of things we can do with this. Let's start...let me start out by by looking at the players that were snubbed. So these are...these are our player...this these are my players that were snubbed. So okay, so these are 100 values. So, so, so the line in the sand here would be 100 because I've got 100 different values I've measured. So you'll notice that for the most part, these players were above 100. Here's the list of, of the, of the players that didn't make the list. And if you take a look at these players, you'll notice that there are a couple of guys in here that are obvious. People like Curt Schilling and the and the and the Roger Clemens for non non non career reasons for the for the for the for the for the...that some of the other criteria that Christian mentioned,are not in there. But there's some guys, for example, Clayton Kershaw who's still not done with his career. But there certainly are other people who you might consider..that are Hall eligible. So let's actually, let's look at that, too. So let's look at those folks who are who are Hall eligible, who have not been in the hall... BJ Ryan; again Curt Schilling is in there; Johan Santana, not sure why he didn't make it in the hall; Smokey Joe Wood, pitcher from the early part of the 1900s; and so on. So, the ability for FDE to allow me to extract values from anywhere along their career trajectory is is is an extra tool for me to be able to estimate some additional criteria, in terms of who belongs in the hall and who doesn't belong in the hall. So, enough said about the pitchers. Let's...I'm gonna turn it back to Christian so we can talk a little bit more about the position players. Christian Excellent. Thank you, Don. Right. Okay so Don was talking about the pitching...the pitchers and so I'm looking, I'll be looking at the position donmccormack players, and so there's two different components that go into that. Christian You have your, your batting prowess, as well as your fielding prowess and I took a little different take than than Don did, with respect to just looking at the statistics and then building models. I ended up starting off with just four of the more common batting statistics, and those are the first four on the list here, some of what you'd find the backs of baseball cards. And then as I was progressing, as we'll see, I needed something to capture stolen bases, because the first four don't really...don't do that at all. And so I created a metric I call the base unit average that brings into other base runner movements that...to give credit to the batter for those things. And then the fielding, of course, is a factor as well as we'll see, so I included a couple of metrics for fielding. And so like Don like just mentioned earlier, I wanted to make sure I compared apples to apples, so I'm looking at with reference to position and league and year for this those statistics I mentioned. And then when like Don, I wanted to make sure I I weighted those smaller sample sizes appropriately so they weren't gumming up the system. And so I ended up weighting players' performance relative to the number of plate appearances relative to kind of the average for that league year at a particular lineup slot on how many plate appearances that slot should get over the course of the season. So that's how they're weighted. Right. So let's, let's see what that looks like. We're going to go back and visit Ted Williams again here. So we've got Tim Williams' career on the left here, we saw, and these are the raw scores. And then it looks like he had a really poor season here. But if, once you take a relative component of that, you can see it's actually an average season like Don, it's still above that average line of one. And so it was just a kind of a poor season for Ted Williams on his own standards. And then we saw earlier that these two peaks for Ted Williams might have resembled were his peak performance, but it turns out that those are seasons where he had smaller numbers of samples...of played appearances due to his being...going off to the Korean War. So he ended up having that impact his scores. I weighted accordingly back toward the average again because of the smaller sample sizes. So, that's how we, the types of data. I'm going to focus on just the relative statistics in my conversation here and just focus on some of the things that caught my eye. There we go. And we'll do that. Need the table of numbers here to feed in from the FDE. So here's the scores that we're going to be looking at, the relative FPC scores from the FDE. And what the first thing I saw, I included a four variables in my model, those first four batting statistics, and I wanted to just make sure I had the right components in my, in my analysis. So on the left axis here is the model-driven probability of being in the Hall of Fame. Now what...excuse me, that's the y axis, on the x axis is whether or not the person actually is in the Hall of Fame. And so my misclassification areas are these two sections here. And I noted that there were some players down here more than I was kind of expecting. So I was exploring and we might explore variables that I didn't yet include, like stolen bases. And so I'll pop those in for color and size. And as you can see, it seemed pretty clear to me that stolen bases is definitely a factor that the Hall of Fame voters were taking into account. These are... so the color and size are relative to the number of stolen bases over their career. And this is what drove me to create that base unit average statistic that I then used. So adding in...as I was exploring those models I as I described, I started off with four statistics and then added in that BUA statistic. This is my x axis now. And then I added in fielding statistics and we what we have here is a parallel plot, where the y axis is again...is a probability of the model suggesting the player should be in the Hall of Fame and each of the lines now is a player. And so the color represents their Hall of Fame status. Red is yes, there were already admitted, and blue is no. And so I like this plot because it allows me to look at to see who's moving. If I can see the impact of those additional variables in the model. And of course the first thing that caught my eye was this guy here, that how it popped up from being a not really to adding the stolen base component, and we can see that he's a high probability being elected to Hall of Fame and so belonging, depending on how you look at it. And it's Ricky Henderson, who happens to be the career leader of stolen bases. Now another player, and just looking at the defensive side of things, is Kirby Puckett, whose initial statistics suggest that, based on the initial model, that he makes it; he's qualified sufficiently just across the line. But then, you know, back if you add in the stolen base component, yeah, he actually doesn't seem to qualify any longer. And then finally, we put in the fact that he's, he's a really good fielder, he won a number of golden gloves playing the center field for the twins, we see that he's back in the good graces of the Hall of Fame committee, and rightfully voted in. This is kind of a messy model. Not messy model and you did a lot of stuff going on here. So, I ended up adding in my local data filter so I could kind of look at each position individually. And here for first base, it's a lot easier to see that we have the, the folks in red and then in blue. Now we've got somebody here, this is Todd Helton who, at least in all the models that we were looking at suggest that he should be admitted to Hall of Fame and he's still eligible. So he's still waiting to the call. But someone like Dick Allen, there's also blue, not in. His numbers, at least based on the the summary stats, the FDE statistics that we're using and the models suggest he shouldn't...he belongs in the Hall. And there are other folks who are red, down in the bottom, like Jim Thome, who the models suggest he doesn't really belong, but he was voted in. So, different ways of exploring those different relationships among, as we add in those predictors. Now, like Don, I wanted to get a sense of, well, who's, who was snubbed and who might have been gifted or at least had, you know, non statistically oriented components to his consideration. And so I, like Don, running a number of models and settled on four models that I was, I liked and did the best job of... predictive job, and like Don, rather than just using age in my FDE as the x axis, I also based it on a cumulative percent played appearances. And so that would...having these two different variants gave me a number of models to look at. And so I drilled down to just the folks who across all eight models, are in the Hall of Fame, but none of the models suggest they should be. And that's this line here. There's 31 of those. And the reverse side I have in green here, the folks who the models in either of the buckets...the majority the models in either bucket of age versus Kimball diff percent of plate appearances suggest they do belong in the Hall of Fame, but they're not. So I pulled all these folks out and just like Don wanted to, just compare what what are their trajectories look like and is there...are they close at least, or is there something else going on here? And so you can see from the this is the on on base percentage plus slugging percentage, OPS, again. It certainly looks like, in red and the plus signs, that the folks who were snubbed performed a lot better on this metric, and as it turns out, every other offensive stat metric better than the gifted folks, the folks who are in, but the model suggests shouldn't be. And that made me think, Well, is it, is it just the offensive stats that are and maybe the fielding is where the, the, the folks who were in already shine? And based on what at least fielding percentage, it actually suggests that there that still is the case, where... actually this is this snubbed folks. The, the gifted folks still look like they were... they don't necessarily belong as much as the these snubbed folks do. It was only on the range factor component where the tide reversed. And so you end up seeing the gifted folks outweigh the snubbed folks who performed better. That's another different take, much like Don's, that you can use to evaluate just what the components are included in your model. A lot of different ways we can look at the data here. So just wrapping up because I'm sure some of you are just burning to know who is snubbed and who is gifted among those folks. These are some of the folks that were snubbed, at least among the position players and, like Don mentioned for some of his pitchers, there's a few of these folks who are banned from baseball, so they're not exactly snubbed, so. you probably recognize some of these. And then these are some of the players who were gifted, or at least it the criteria of their statistics alone is...it may not have been what got them in the Hall of Fame. Right, so just wrapping up where we've been, we've been able to take those player career trajectories of their performance on...pick a metric and put that into the functional data explorer and get out numerical summaries that capture the essence of those curves. And then, in turn, use those statistics those scores that we get to be able to put those in our traditional statistic techniques that we're familiar with. And so now we can change that question from how you model or quantify career trajectory and revise it to a question of what do I want to explore with these FPC scores I've got? So we hope you enjoyed talking about baseball and just that interaction to baseball and JMP and FDE. And hope you feel empowered to go and take the FDE tool that's available in JMP Pro to address questions with data like who your favorite player is and why, and have the means of backing it up. Thanks for joining us. Take care. donmccormack Okay, so how do we deal with these cases where we need to look at somebody's career trajectory? Are there other metrics where we can make these comparisons, so that we could tell these really fine gradations apart? So as I as I alluded to earlier, what we could do is we could we could certainly we could we could look at absolutely any point along the along the person's career trajectory with any amount of gradation that we want to. And I did that. I took 100 data points, 100 values between zero and one, start of the career, end of the career, and I summed up over all those values. And I did this...the nice thing about this technique is that I can do it for multiple metrics. So, so now what we're looking at here is we are looking at, we're looking at a plot of all four metrics. We can plot them all on one graph. We're going to go back again to that group of folks that were that were snubbed, these folks here. So that's so...so if we take a look at these folks, we see that they had a low...by the way, 100 in this case because there were 100 observations. hundred home runs per nine, you want that low; percent batters faced over the minimum, low; and then the strikeouts over nine innings, you want on the high side. You'll notice that that's kind of the trajectory that folks follow. Now then, the interesting thing about this point is, that what I can do is, I can use any criteria that I want to. So for example, let's say I'm going to look at...I'm going to consider all my players and I only want to consider those people who had A WHIP that was below, in this case, 100...so better than that...that's actually that's...even make it better than that. Let's say 90 or below. Okay, so let's look at those folks who, you know, at least have the average number of strikeouts per nine innings, and maybe their batters per...percent batters faced over 100 is at a minimum. And so, and I'll disregard home runs for nine here. I also, you could also standardize and normalize by the number of seasons and I've done that exactly. So what I want to do is I want to look at those players that maybe only have 10 season equivalents, where a season equivalent is based on what was the average player season like. All right. And then finally, what kind of workload they had over their, their entire career. And let's say we want somebody who had at least 80%, let's make a little bit more stricter, let's say, let's say, about the same workload. And again, we can use different criteria to weed out those folks who we don't think we should consider and those folks who we do think we consider and then using those criteria... I also want to say let's let's take a look at those folks that are not in the Hall of Fame. So here we go. Now we have a list of people who are worth considering. And you'll notice that they're they're quite a few folks folks that probably shouldn't surprise you. These are folks that are either not in the hall yet because they're still playing or just have been disregarded know, Chris Sale, for example, is still pitching. Curt Schilling, for obvious reasons is not the hall. Johan Santana, why, why isn't he in the hall? He was actually part of that group that that that were snubbed. So the nice thing about using these FDEs is that you can take them, turn them into your career trajectories, and then use an additional metric to be able to determine hall worthiness and non Hall worthiness.  
Wenjun Bao, Chief Scientist, Sr. Manager, JMP Life Sciences, SAS Institute Inc Fang Hong, Dr., National Center for Toxicological Research, FDA Zhichao Liu, Dr., National Center for Toxicological Research, FDA Weida Tong, Dr., National Center for Toxicological Research, FDA Russ Wolfinger, Director of Scientific Discovery and Genomics, JMP Life Sciences, SAS   Monitoring the post-marketing safety of drug and therapeutic biologic products is very important to the protection of public health. To help facilitate the safety monitoring process, the FDA has established several database systems including the FDA Online Label Repository (FOLP). FOLP collects the most recent drug listing information companies have submitted to the FDA. However, navigating through hundreds of drug labels and extracting meaningful information is a challenge; an easy-to-use software solution could help.   The most frequent single cause of safety-related drug withdrawals from the market during the past 50 years has been drug-induced liver injury (DILI). In this presentation we analyze 462 drug labels with DILI indicators using JMP Text Explorer. Terms and phrases from the Warnings and Precautions section of the drug labels are matched to DILI keywords and MedDRA terms. The XGBoost add-in for JMP Pro is utilized to predict DILI indicators through cross validation of XGBoost predictive models by the term matrix. The results demonstrate that a similar approach can be readily used to analyze other drug safety concerns.        Auto-generated transcript...   Speaker Transcript Olivia Lippincott What wenjba It's my pleasure to talk about this, Obtain high quality information from FDA drug labeling system and in the JMP discovery. And today I'm going to talk about four portions. The first, I'll give some background information about drug post marketing monitoring and what is the effort from the FDA regulatory agency and industry. And also, I'm going to use a drug label data set to analyze the text and using the Text Explorer in JMP and also use the add in JMP add in XGBoost to analyze this DILI information and then give the conclusion and also the XGBoost tutorial by Dr. Russ Wolfinger is also present in this JMP Discovery Summit so please go to his tutorial if you're interested in XGBoost. So the drug development, according to FDA description for the drug processing, it can be divided by five stages and the first two stages, discover and research preclinical, many in the in the for the animal study and chemical screen, and then later three stages involve the human. And JMP has three products, including JMP Genomics, JMP Clinical, JMP and JMP Pro, that have covered every stage of the drug discovery. And JMP Genomics is the omics system that can be used for omics and clinical biomarkers selection and JMP Clinical is specific for the clinical trial and post marketing monitoring for the drug safety and efficacy. And also for the JMP Pro can be used for drug data cleaning, mining, target identification, formulation development, DOE, QbD, bioassay, etc. So it can be used every stage of the drug development. So in this drug development, there's most frequent single cause, called a DILI, actually can be stopped for the clinical trial. The drug can be rejected for approval by the FDA or the other regulatory agency, or be recalled once the drug is on market. So this is the most frequent the single cause called the DILI and can be found the information in the FDA guide and also other scientific publications. So what is DILI? This actually is drug-induced liver injury, called DILI and you have FDA, back almost more than 10 years ago in 2009, they published a guide for DILI, how to evaluation and follow up, FDA offers multiple years of the DILI training for the clinical investigator and those information can still find online today. And they have the conferences, also organized by FDA, just last year. And of course for the DILI, how you define the subject or patient could have a DILI case, they have Hy's Law that's included in the FDA guidance. So here's an example for the DILI evaluation for the clinical trial, here in the clinical trial, in the JMP Clinical by Hy's Law. So the Hy's Law is the combination condition for the several liver enzymes when they elevate to the certain level, then you would think it would be the possible Hy's Law cases. So you have potentially that liver damages. So here we use the color to identify the possible Hy's Law cases, the red one is a yes, blue one is a no. And also the different round and the triangle were from different treatment groups. We also use a JMP bubble plot to to show the the enzymes elevations through the time...timing...during the clinical trial period time. So this is typical. This is 15 days. Then you have the subject, starting was pretty normal. Then they go kind of crazy high level of the liver enzyme indicate they are potentially DILI possible cases. So, the FDA has two major databases, actually can deal with the post-marketing monitoring for the drug safety. One is a drug label and which we will get the data from this database. Another one is FDA Adverse Event Reporting System, they then they have from the NIH and and NCBI, they have very actively built this LiverTox and have lots of information, deal with the DILI. And the FDA have another database called Liver Toxic Knowledge Base and there was a leading by Dr. Tong, who is our co are so in this presentation. They have a lot of knowledge about the DILI and built this specific database for public information. So drug label. Everybody have probably seen this when you get prescription drug. You got those wordy thing paper come with your drug. So they come with also come with a teeny tiny font words in that even though it's sometimes it's too small to read, but they do contain many useful scientific information about this drug Then two potions will be related to my presentation today, would be the sections called warnings and precautions. So basically, all the information about the drug adverse event and anything need be be warned in these two sections. And this this drug actually have over 2000 words describe about the warnings and precautions. And fortunately, not every drug has that many side effects or adverse events. Some drugs like this, one of the Metformin, actually have a small section for the warning and precautions. So the older version of the drug label has warnings and precautions in the separate sections, and new version has them put together. So this one is in the new version they put...they have those two sections together. But this one has much less side effects. So JMP and the JMP clinical have made use by the FDA to perform the safety analysis and we actually help to finalize every adverse event listed in the drug labels. So this is data that got to present today. So we are using the warning and precaution section in the 462 drug labels that extracted by the FDA researchers and I just got from them. And the DILI indicator was assigned to each drug. 1 is yes and the zero is no. So from this this distribution, you can see there's about one...164 drugs has potential DILI cases and 298 doesn't and the original format for the drug label data is in the XML format and that can be imported by JMP multiply at once. So for the DILI keywords and was a many years effort by the FDA to come up this keyword list. Then they actually by the expert, reading hundreds of drug label and then decided what could potentially become the DILI cases. So then they come up with those about 44 words or terms to be indicated as a keyword, to be indicated for the drug could be the DILI cases. And you may also heard about MedDRA, which is a medical dictionary for regulatory activities. They have different levels of a standardized terms and most popular one is preferred term. I'm going to be using today. So in the warning and precaution, you can see if we pull everything together, you have over 12,000 terms in the warnings and the precautions section. And you can see that "patients" and "may" is a dominant which made not...should not be related to the medical cases and the medical information in this case. So we can remove that, you can see that not any other words are so dominant in this word cloud, but it still have many medical unrelated words like "use" and like "reported" that we could put into... could remove them to our analysis list. So in the in the Text Explorer, we can put them into the stop word and also we normally were using the different Text Explorer technology is stemming, tokenizing, regex, recoding and deleting manually. to clean up the list. But it had 12,000 terms, so it could be very time consuming. But since we have the list we are interested in, so we want to take advantage that we already knew what we are interested in the terms in this system. So what we're going to do and I'm going to show you in the demo that we'll only use the DILI keywords, plus the preferred term from the MedDRA to generate the interesting terms and the phrases to do the prediction. So here is the example we saw using only the DILI keywords. Then you see everything over here, you can see even in the list. You have a count number showed at the side for each of terms, how many times they are repeated in the warnings and precaution section and also you can see more colorful, more graphic in the world cloud to get a pattern recognized. And then we add the medical terms, that was the medical related terms. So it's still come down from the 12,000 terms to the 1190 terms that was including DILI keywords and medical preferred terms. So we think this would be the good term list to start with to do our analysis. So what we do is in the JMP Text Explorer, we can save the term...document term matrix. That means if you see 1 that means this document have seen this term, if it says, if this is 0, this means this document has not see, have a case of this word. So then we, in the XGBoost will make k fold, and three k folds, use each one with five columns. So we use in this machine learnign and use XGBoost tree model which is add in for the JMP Pro and we...using the DILI indicator to as a target variable and they use the DILI keywords and also the MedDRA preferred terms that have shown up more than 20 times to...as a predictor. Then we use a cross validation XGBoost then it 300 times interation. Now we got statistical performance metrics, we get term importance to DILI, and we get, we can use the prediction profiler for interactions and also we can generate and the save the prediction formula for new drug prediction. So I'm going to the demo. So this is a sample table we got in the in JMP. So you have a three columns. Basically you have the index, which is a drug ID. Then you have the warnign and precaution, it could have contain much more words that it's appeared, so basically have all the information for each drug. Now you have a DILI indicator. So we do the Text Explorer first. We have analysis, you go to the Text Explorer, you can use this input, which is a warning and precaution text and you would you...normally you can do different things over here, you can minimize characters, normally people go to 2 or do other things. Or you could use the stemming or you could use the regex and to do all kind of formula and in our limitation can be limited. For example, you can use a customize regex to get the all the numbers removed. That's if only number, you can remove those, but since we're going to use a list, we'll not touch any of those, we can just go here simply say, okay, So it come up the whole list of this, everything. So now I'm going to say, I only care about oh, for this one, you can do...you can show the word cloud. And we want to say I want to center it and also I want to the color. So you see this one, you see the patient is so dominant, then you can say, okay this definitely...not the... should not be in the in analysis. So I just select and right click add stop word. So you will see those being removed and no longer showed in your list and no longer show in the word cloud. So now I want to show you something I think that would speed up the clean up, because there's so many other words that could be in the system that I don't need. So I actually select and put everything into the stop word. So I removed everything, except I don't know why the "action" cannot be removed. And but it's fine if there's only one. So what I do is I go here. I said manage phrase, I want to import my keywords. Keyword just have a... very simple. The title, one column data just have all the name list. So I import that, I paste that into local. This will be my local library. And I said, Okay. So now I got only the keyword I have. OK, so now this one will be...I want to do the analysis later. And I want to use all of them to be included in my analysis because they are the keywords. So I go here, the red triangle, everything in the Text Explorer, hidden functions, hidden in this red triangle. So I say save matrix. So I want to have one and I want 44 up in my analysis. I say okay. So you will see, everything will get saved to my... to the column, the matrix. So now I want to what I want to add, I want to have the phrase, one more time. I also want to import those preferred terms. into the my database, my local data. Then also, I want to actually, I want to locally to so I say, okay. So now I have the mix, both of the the preferred terms from the MedDRA and also my keywords. So you can see now the phrases have changed. So that I can add them to my list. The same thing to my safe term matrix list and get the, the, all the numbers...all the terms I want to be included. And the one thing I want to point out here is for these terms and they are...we need to change the one model format. This is model type is continuing. I want to change them to nominal. I will tell you why I do that later. So now I have, I can go to the XGBoost, which is in the add in. We can make...k fold the columns that make sure I can do the cross validation. I can use just use index and by default is number of k fold column to create is three and the number of folds (k) is within each column is five, we just go with the default. Say, okay, it will generate three columns really quickly. And at the end, you are seeing fold A, B, C, three of them. So we got that, then we have... Another thing I wanted to do is in the... So we can We can create another phrase which has everything...that have have everything in...this phrase have everything, including the keywords and PT, but I want to create one that only have the only have only have the the preferred term, but not have the keyword, so I can add those keywords into the local exception and say, Okay. So those words will be only have preferred terms, but not have the keywords. So this way I can create another list, save another list of the documentation words than this one I want to have. So have 1000, but this term has just 20. So what they will do is they were saved terms either meet... have at least show up more than 20 times or they reach to 1000, which one of them, they will show up in the my list. So now I have table complete, which has the keywords and also have the MedDRA terms which have more than 20, show more than 20 times, now also have ??? column that ready for the analysis for the XGBoost. So now what I can do is go to the XGBoost. I can go for the analysis now. So what I'm going to do show you is I can use this DILI indicator, then the X response is all my terms that I just had for the keyword and the preferred words. Now, I use the three validation then click OK to run. It will take about five minutes to run. So I already got a result I want to show you. So you have... This is what look like. The tuning design. And we'll check this. You have the actual will find a good condition for you to to to do so. You can also, if you have as much as experience like Ross Wolfinger has, he will go in here, manually change some conditions, then you probably get the best result. But for the many people like myself don't have many experienced in XGBoost, I would rather use this tuning design than just have machine to select for me first, then I can go in, we can adjust a little bit, it depend on what I need to do here. So this is a result we got. You can use the...you can see here is different statistic metrics for performance metrics for this models and the default is showed only have accuracy and you can use sorting them by to click the column. You can sorting them and also it has much more other popular performance metrics like MCC, AUC, RMSE, correlation. They all show up if you click them. They will show up here. So whatever you need, whatever measurement you want to do, you can always find here. So now I'm going to use, say I trust the validation accuracy, more than anything else for this case. So I want to do is I want to see just top model, say five models. So what here is I choose five models. Then I go here, say I want to remove all the show models. So you will see the five models over here and then you can see some model, even though the, like this 19 is green, it doesn't the finish to the halfway. So something wrong, something is not appropriate for this model. I definitely don't want to use that one, so others I can choose. Say I want to choose this 19, I want to remove that one. So I can say I want to remove the hidden one. So basically just whatever you need to do. So if you compare, see this metrics, they're actually not much, not much different. So I want to rely on this graphic to help me to choose the best one to do the performance. So then you choose the good one. You can go here to say, I like the model 12 so I can go here, say I want to do the profiler. So this is a very powerful tool, I think quite unique to JMP. Not many tools have this function. So this gives you an opportunity to look at individual parameters in the in the active way and see how they how they change the result. For example those two was most frequently show up in the DILI cases. And you can see the slope is quite steep and that means if you change them, they will affect the final result predictions quite a bit. So you can see when the hepatitis and jaundice both says zero, you actually have very low possibility to get the DILI as one. So is low case for the possible DILI cases. But if you change this line, to the 1, you can see the chance you get is higher. And if you move those even higher. So you have, you will have a way to analyze, if they are the what is the key parameters or predictor to affect your result. And for this, some of them, even their keyword, they're pretty flat. So that means if you change that, it will not affect the result that much. So So this is and also we here, we gave the list you can get to to see what is the most important features to the calculate variables prediction. So you can see over here is jaundice and others are quite important. And for the for the feature result, once you get the data in, this is all the results that we we have. And you can say, well, what...how about the new things coming? Yes, we have here, you can say, I want to save prediction formula. And you can see it's actively working on that. And then in the table, by the end of table, you will see the prediction. So remember we had one...this was, say, well, the first drug, second was pretty much predict it will be the DILI cases and the next two, third, and the fourth, and the fifth was close to zero. So we go back to this DILI indicator and we found out they actually list. The first five was right one. So, in case you have...don't have this indicator when you have the new data come in, you don't have to read all the label. You run the model. You can see the prediction. Pretty much you knew if it is it is DILI cases or not. So my deomo would be end here, and now I'm going to give a conclusion. So we are using the Text Explorer to extract the data keyword and MedDRA terms using Stop Words and Phrase Management without manually selection, deletion and recoding. So we use a visualization and we created a document term matrix for prediction. And also we use machine learning for the using the XGBoost modeling and we want to quickly to run the XGBoost to find the best model and perform predict profile. And also we can save the predict formula to predict the new cases. Thank you. And I stop here.  
Michael Crotty, JMP Senior Statistical Writer, SAS Marie Gaudard, Statistical Consultant, Statistical Consultant Colleen McKendry, JMP Technical Writer, JMP   The need to model data sets involving imbalanced binary response data arises in many situations. These data sets often require different handling than those where the binary response is more balanced. Approaches include comparing different models and sampling methods using various evaluation techniques. In this talk, we introduce the Imbalanced Binary Response add-in for JMP Pro that facilitates comparing a set of modeling techniques and sampling approaches. The Imbalanced Classification add-in for JMP Pro enables you to quickly generate multiple sampling schemes and to fit a variety of models in JMP Pro. It also enables you to compare the various combinations of sampling methods and model fits on a test set using Precision-Recall ROC, and Gains curves, as well as other measures of model fit. The sampling methods range from relatively simple to complex methods, such as the synthetic minority oversampling technique (SMOTE), Tomek links, and a combination of the two. We discuss the sampling methods and demonstrate the use of the add-in during the talk.   The add-in is available here: Imbalanced Classification Add-In - JMP User Community.     Auto-generated transcript...   Speaker Transcript Michael Crotty Hello. Thank you for tuning into   our talk about the imbalanced classification add in that allows you to compare sampling techniques and models in JMP Pro.   I'm Michael Crotty. I'm one of these statistical writers in the documentation team at JMP and my co-presenter today is Colleen McKendry, also in the stat doc team. And this is work that we've collaborated on with Marie Gaudard.   So here's a quick outline of our talk today. We will look at the purpose of the add in that we created, some background on the imbalanced classification problem and how you obtain a classification model in that situation.   We'll look at some sampling methods that we've included in the add in and that are popular for the imbalanced classification problem.   We'll look at options that are available in the add in   and talk about how to obtain the add in, and then Colleen will show an example and a demo of the add in.   In the slides that are available on the Community, there's also references and an appendix that has additional background information.   So the purpose of our add in, the imbalanced classification add in, it lets you apply a variety of sampling techniques that are designed for imbalanced data.   You can compare the results of applying these techniques, along with various predictive models that are available in JMP Pro.   And you can compare those models and sampling technique fits using precision recall curves, ROC curves, and Gains curves, as well as other measures.   This allows you to choose a threshold for classification using the curves.   And you can also apply the Tomek, SMOTE and SMOTE plus Tomek sampling techniques directly to your data, which enables you to then use existing JMP platforms and   on on that newly sampled data and fine tune the modeling options, if you don't like the mostly default method options that we've chosen.   And just one note, the Tomek, SMOTE and SMOTE plus Tomek sampling techniques can be used with nominal and ordinal, as well as continuous predictor variables.   So some background on the imbalanced data problem.   So in general, you could have a multinomial response, but we will focus on the response variable being binary, and the key point is that the number of observations at one response level is much greater than the number of observations had the other response level.   And we'll call these response levels the majority and minority class levels, respectively. So the minority level, most of the time, is the level of interest that you're interested in predicting and detecting. This could be like detecting fraud or the presence of a disease or credit risk.   And we want to predict class membership based on regression variables.   So to do that we developed a predictive model that assigns probabilities of membership into the minority class and then we choose a threshold value that optimizes   various criteria. This could be misclassification rate, true positive rate, false positive rate, you name it. And then we classify an observation, who's into the minority class, if the predicted probability of membership to the minority class exceeds the chosen threshold value.   So how do we obtain a classification model?   We have lots of different platforms in JMP that can make a prediction for a binary variable, binary outcome   when in the presence of regression variables, and we need a way to compare those models. Well, there are some traditional measures, like classification accuracy, are not all that appropriate for imbalanced data. And just as a extreme example, you could consider the case of a 2% minority class.   I could give you 98% accuracy, just by classifying all the observations as majority cases. Now this would not be a useful model and you wouldn't want to use it,   because you're not predicting...you're not correctly predicting any of your target cases to minority cases but just overall accuracy, you'd be at 98%, which sounds pretty good.   So this led people to explore other ways to measure classification accuracy in a imbalanced classification model. One of those is the precision recall curve.   They're often used with imbalanced data and they plot the positive predictive value or precision against the true positive rate recall.   And because the precision takes majority instances into account, the PR curve is more sensitive to class imbalance than an ROC curve.   As such, a PR curve is better able to highlight differences in models for the imbalanced data. So the PR curve is what shows up first in our report for our add in.   Another way to handle imbalanced classification data is to use sampling methods that help to model the minority class.   And in general, these are just different ways to impose more balance on the distribution of the response, and in turn, that helps to better delineate the boundaries between the majority and minority class observations. So in in our add in we have seven different sampling techniques.   We won't talk too much about the first four and we'll focus on the last three, but very quickly, no weighting means what it sounds like. We won't do any...won't make any changes and that's   essentially in there to provide a baseline to what you would do if you didn't do any type of sampling method to account for the imbalance.   Weighting will overweight the minority cases so that the sum of the weights of the majority class and the minority class are the same.   Random undersampling will randomly exclude majority cases to get to a balanced case and random oversampling will randomly replicate   minority cases again to get to a balanced state.   And then we'll talk more about the next three more advanced methods in the following slides.   So first of the advanced methods is SMOTE, which stands for synthetic minority oversampling technique.   And this is basically a more sophisticated form of oversampling, because we are adding more minority cases to our data.   We do that by generating new observations that are similar to the existing minority class observations, but we're not simply replicating them like in oversampling.   So we use the Gower distance function and perform K nearest neighbors on each minority class observation and then observations are generated to fill in the space that are defined by those neighbors.   And in this graphic, you can see if we've got this minority case here in red. We've chosen the three nearest neighbors.   And we'll randomly choose one of those. It happens to be this one down here, and then we generate a case, another minority case that is somewhere in this little shaded box. And that's in two dimensions. If you had   n dimensions of your predictors, then that shaded area would be an n dimensional space.   But one key thing to point out is that you can choose the number of nearest neighbors that you   randomly choose between, and you can also choose how many times you'll perform this   this algorithm per minority case.   The next sampling method is Tomek links. And what this method does is it tries to better define the boundary between the minority and majority classes. To do that, it removes observations from the majority class that are close to minority class observations.   Again, we use to Gower distance to find Tomek links and Tomek link is a pair of nearest neighbors that fall into different classes. So one majority and one minority that are nearest neighbors to each other.   And to reduce the overlapping of these instances, one or both members of the pair can be removed. In the main option of our add in, the evaluate models option, we remove only the majority instance. However, in the Tomek option, you can use either form of removal.   And finally, the last sampling method is SMOTE plus Tomek. This combines the previous two sampling methods.   And the way it combines them is it applies this mode algorithm to generate new minority observations and then once you've got your original data, plus a bunch of generated new minority cases,   tt applies to Tomek algorithm to find pairs of nearest neighbors that fall into different classes. And in this method both observations in the Tomek pair are removed.   So the imbalanced classification add in has four options when you install it that all show up as submenu items under the add ins menu.   The first one is the evaluate models option, that allows you to fit a variety of models using a variety of sampling techniques. The next three are just standalone dialogues to just do those three sampling techniques that we just talked about.   So in the evaluate models option of the add in, it provides an imbalanced classification report that facilitates comparison of the model and sampling technique combinations.   It shows the PR curve and ROC curves, as well as the Gains curves, and for the PR and ROC curves, it shows the area under the curve, which generally, the more area under each of those curves, the better a model is fitting.   It provides the plot of predicted probabilities by class that helps you get a picture of how each model is fitting.   And it also provides a techniques and thresholds data table, and that table contains a script that allows you to reproduce the report   that is produced the first time you run the add in. And we want to emphasize that if you run this and you want to save your results without rewriting the entire   modeling and sampling methods algorithm, you can save this techniques and thresholds table and that will allow you to save your results and reproduce the report.   So now we'll look at the dialogue for the evaluating models option. It allows you to choose from a number of models and sampling techniques.   You can put in what your binary class variable is and all your X predictors, and then   we, in order to fit all the models and and   evaluate them on the on a test set, we randomly divide the data into training validation and test sets. You can provide up...you can set the proportions that will go into each of those sets.   There's a random seed option if you'd like to reproduce the results. And then there are SMOTE options   that I alluded to before, where you can choose the number of nearest neighbors, from which you select one to be the nearest neighbor used to generate a new case, and replication of each minority case is how many times you repeat the algorithm for each minority observation.   Again, there are three other sampling option   options in the add in and those correspond to Tomek, SMOTE and SMOTE plus Tomek. In the Tomek sampling option, it's going to add two columns to your data table that can be used as weights for the predict...for any predictive model that you want to do.   The first column removes only the majority nearest neighbor in the link and the other removes both members of the Tomek link, so you have that option.   SMOTE observations will add synthetic observations to your data table.   And it will also it will provide a source column so that you can identify which   observations were added. And SMOTE plus Tomek add synthetic observations and the weighting column that removes both members of the Tomek link.   And the weighting column from the Tomek sampling and SMOTE plus Tomek,   it's just an indicator column that you can use as a weight in a JMP modeling platform. It's just a 1 if it's included, and a 0 if it should be excluded.   Most of the three other sampling option dialogues look basically the same.   One option that's on them and not on the evaluate models option dialogue is show intermediate tables. This option appears for SMOTE and SMOTE plus Tomek.   And basically, it allows you to see data tables that were used in the construction of the SMOTE observations. In general, you don't need to see it, but if you want to better understand how those observations are being generated, you can take a look at those intermediate tables.   And they're all explained in the documentation.   Again, you can obtain the add in through the Community,   through this the page for this talk on the Discovery Summit Americas 2020   part of the Community. And as I mentioned just a second ago, there's documentation available within the add in. Just click the Help button.   And now it is time for Colleen to show an example in a demo of the add in. Colleen McKendry Thanks Michael. I'm going to demo the add in now, and to do the demo, I'm going to use this mammography demo data.   And so the mammography data set is based on a set of digitized film mammograms used in a study of microcalcifications in mammographic images.   And in this data, each record is classified as either a 1 or a 0. 1 represents that there's calcification seen in the image, and a 0 represents that there is not.   In this data set, the images where you see calcification, those are the ones you're interested in predicting and so the class level one is the class level that you're interested in.   In the full data set, there are six continuous predictors and about 11,000 observations.   But in order to reduce the runtime in this demo, we're only going to use a subset of the full data set. And so it's going to be about half the observations. So about 5500 observations.   And the observations that are classified as 1, the things that you're interested in, they represent about 2.31% of the total observations, both in the full data set and in the demo data set that we're using. And so we have about a 2% minority proportion.   And now I'm going to switch over to JMP   to   So I have the mammography demo data.   And we're going to open and I've already installed the add in. So I have the imbalanced classification add in in my drop down menu and I'm going to use the evaluate models option.   And so here we have the launch window, and we're going to specify the binary class variable, your predictor variables, we're going to select all the models and all the techniques and we're going to specify   a random seed.   And click OK.   And so while this is running, I'm going to explain what's actually happening in the background. So the first thing that the add in does is that it splits the data table into a training data set and a test data set.   And so you have two separate data tables and then within the training data table those observations are further split into training and validation observations and the validation is used in the model fitting.   And so once you have those two data sets,   there are indicator variables...indicator columns that are added to the training data table for each of the sampling techniques that you specify, except for those that have involve SMOTE.   And so those columns are added and are used as weighting columns and they just specify whether the observation is to be included in the analysis or not.   If you specify a sampling technique with SMOTE, then there are additional rows that are added to the data table. Those are the generated observations.   So once your columns and your rows are generated then for each model, each model is fit to each sampling technique. And so if you select all of them   like we just did here, there are a total of 42 different models that are being fit. And so, that's all what's happening right now. In   the current demo, we have 42 models being fit and once the models are fit, then the relevant information is gathered and put together in a results report. And that report,   which will hopefully pop up soon, here it is, that report is shown here. And you also get a techniques and thresholds table and a summary table.   And so we're going to take a look at what you get when you run the add in. So first we have the training set. And you can see that here are the weighting columns, the weight columns that are added. And these are the columns that are added for the predicted probabilities for those observations.   Then we have the test set. This doesn't contain any of those weighting columns, but it does have the predicted probabilities for the test set observations.   We have the results report   and the techniques and thresholds data table. And so Michael mentioned this in   the talk earlier, but this is important because this is the thing that you would like to save if you want to save your results and view your results again   without having to rerun the whole script. And so this data table is what you would save and it contains scripts that will reproduce   the results report and the summary table, which is the last thing that I have to show. And so this is just contains summaries for each sampling technique and model combination and their AUC values.   So now to look at the actual results window, at the top we have dialogue specifications. And so this contains the information that you specified in the launch window.   So if you forget anything that you specified, like your random seed or what proportions you assign, you can just open that and take a look.   And we also have the binary class distribution. So, the distribution of the class variable across the training and the test set. And this is designed so that the proportion should be the same, which they are in this case at 2.3.   And then we also have missing threshold. So this isn't super important, but it just gives   an indication of if a value of the class variable has a missing prediction value, then that's shown here.   For the actual results, we have these tabbed graphs. And so we have the precision recall curves, the ROC curves, and the cummulative Gains curves. And for the PR curves and the ROC curves, we have the corresponding AUC values as well.   We also have these graphs of the predicted probabilities by class. And those are more useful when you're only viewing a few of them at a time, which we will later on.   And then we have a data filter that connects all these graphs and results together.   So for our actual results for the analysis, we can take a look now. So first I'm going to sort these.   So you can already see that the ROC curve and the PR curve, there's a lot more differentiation between the curves in the PR curve than there is in the ROC curve.   And if we select the top, say, five, these all actually have an AUC of .97.   And you can see that they're all really close to each other. They're basically on top of each other. It would be really hard to determine which model is actually better, which one you should pick   And so that's where, particularly with imbalanced data, the precision recall curves are really important. So if we switch back over, we can see that these models that had the highest AUC values for the ROC curves,   they're really spread out in the precision recall curve. And they're actually not...they don't have the highest AUC values for the PR curve.   So maybe that there...maybe there's a better model that we can pick.   So now I'm going to look and focus on the top two, which are boosted tree Tomek and SVM Tomek, and I'm going to do that using the data filter.   And then we just want to look at those are going to show and include.   So now we have the curves for just these two models and the blue curve is the boosted tree and the red curve is SVM.   And so you can see in these curves that they kind of overlap each other across different values of the true positive rate. And so you could use these curves   to choose which model you want to use in your analysis, based on maybe what an acceptable true positive rate would be. So we can see this if I add some reference lines. Excuse my hands that you will see as I type this.   Okay, so say that these are some different true positive rates that you might be interested in. So if, for example, for whatever data set you have, you wanted a true positive rate of .55.   You could pick your threshold to make that the true positive rate. And then in this case,   for that true positive rate, the boosted tree Tomek model has a higher precision. And so you could you could pick that model.   However, if you wanted your true positive rate to be something like .85, then the SVM model might be a better pick because it has a higher precision for that specific true positive rate.   And then if you had a higher true positive rate of .95, you would flip again and maybe you would want to pick the boosted tree model.   So that's how you can use these curves to pick which model is best for your data.   And now we're going to look at these graphs again, now that there are only a few of them. And this just shows the distribution of predicted probabilities for each class for the models that we selected. So in this particular case, you can see that in SVM there are majority   probabilities throughout the kind of the whole range of predicted probabilities, where boosted tree does kind of a better job of keeping them at the lower end.   And so that's it for this particular demo, but before we're done, I just wanted to show one more thing. And so that was an example of how you would use the evaluate   models option. But say you just wanted to use a particular sampling technique. And you can do that here. So the launch window looks much the same. And you can assign your binary class, your predictors, and click OK.   And this generates a new data table and you have your   indicator column.   Your indicator column, which just shows whether the observation should be included in the analysis or not.   And then because it was SMOTE plus Tomek you also have all these SMOTE generated observations.   So now you have this new data table and you can use any type of model with any type of options that you may want and just use this column as your weight or frequency column and go from there. And that is the end of our demo for the imbalanced classification add in. Thanks for watching.
Monday, October 12, 2020
Kamal Kannan Krishnan, Graduate Student, University of Connecticut Ayush Kumar, Graduate Student, University of Connecticut Namita Singh, Graduate Student, University of Connecticut Jimmy Joseph, Graduate Student, University of Connecticut   Today all service industries, including the telecom face a major challenge with customer churn, as customers switch to alternate providers due to various reasons such as competitors offering lower cost, combo services and marketing promotions. With the power of existing data and previous history of churned customers, if company can predict in advance the likely customers who may churn voluntarily, it can proactively take action to retain them by offering discounts, combo offers etc, as the cost of retaining an existing customer is less than acquiring a new one.  The company can also internally study any possible operational issues and upgrade their technology and service offering. Such actions will prevent the loss of revenue and will improve the ranking among the industry peers in terms of number of active customers. Analysis is done on the available dataset to identify important variables needed to predict customer churn and individual models are built. The different combination of models is ensembled, to average and eliminate the shortcomings of individual models.  The cost of misclassified prediction (for False Positive and False Negative) is estimated by putting a dollar value based on Revenue Per User information and cost of discount provided to retain the customer.     Auto-generated transcript...   Speaker Transcript Namita Hello everyone I'm Namita, and I'm here with my teammates Ayush, Jimmy and Kamal from University of Connecticut to present our analysis on predicting telecom churn using JMP. The data we have chosen is from industry that keeps us all connected, that is the telecom and internet service industry. So let's begin with a brief on the background. The US telecom industry continues to witness intense competition and low customer stickiness due multiple reasons like lower cost, combo promotional offers, and service quality. So to align to the main objective of preventing churn, telecom companies often use customer attrition analysis as their key business insights. This is due to the fact that cost of retaining an existing customer is far less than acquiring a new one. Moving on to the objective, the main goal here is to predict in advance the potential customers who may attrite. And then based on analysis of that data ,recommend customized product strategies to business. We have followed the standard SEMMA approach here. Now let's get an overview of the data set. It consists of total 7,043 rows of customers belonging to different demographics (single, with dependents, and senior) and subscribing to different product offerings like internet service, phone lines, streaming TV, streaming movies and online security. There are about 20 independent variables; out of it, 17 are categorical and three are continuous. The dependent target variable for classification is customer churn. And the churn rate for baseline model is around 26.5%. Goal is now to pre process this data and model it for future analysis. That's it from my end over to you, Ayush. Ayush Kumar Thanks, Namita. I'm Ayush. In this section, I'll be talking about the data exploration and pre processing. In data exploration, we discovered interesting relationships, for instance, variables tenure and monthly charges both were positively correlated to total charges. These three variables we analyzed using scatter plot matrix in JMP, which validated the relationship. Moreover, by using explore missing values functionality, we observed that total charges column had 11 missing values. The missing values were taken care of as a total charges column was excluded due to multicollinearity. After observing the histograms of the variables using exclude outlier functionality, we concluded that the data set had no outliers. The variable called Customer ID had 7,043 unique values which would not add any significance to the target variable. So customer ID was excluded. We were also able to find interesting pattern among the variables. Variables such a streaming TV and streaming movies convey the same information about the streaming behavior. These variables were grouped into a single column streaming to by using our formula in JMP. The same course of action was taken for the variables online backup and online security. We ran logistic regression and decision tree in JMP to find out the important variables. From the effects summary, it was observed that tenure, contract type, monthly charges, streaming to, multiple line service, and payment method showed significant log worth and very important variables in determining the target. The effects on ??? also helped us to narrow down a variable count to 12 statistically significant variables, which formed the basis for further modeling. We use value of ??? functionality and moved Yes of our target variable upwards. Finally, the data was split into training validation and test in 16 20 20 ratio using formula random method. Over to you now, Kamal. Kamal Krishnan Sorry, I am Kamal. I will explain more about the different models built in JMP using the data set. We in total built eight different types of model. On each type of model, we tried various input configuration and settings to improve the results of mainly sensitivity. As our target was to reduce the number of false negatives in the classification. JMP is very user friendly to redo the models by changing the configurations. It was easy to store the results whenever a new iteration of the model is done in JMP and then compare outputs in order to select the optimized model from each type. JMP allowed us to even change the cutoff values from default 0.5 to others and observed the prediction results. This slide shows the results of selected model from eight different type of models. First, as our top target variable journeys categorical we built logistic regression. Then we build decision tree, KNN, ensemble models like Bootstrap forest and boosted tree. Then we built machine learning models like neural networks. JMP allowed us to set the random seed in models like neural networks and KNN. This helped us to get the same outputs we needed. Then we built naive Bayes model. JMP allowed us to study the impact of various variables through prediction profiler. We can point and click on to change the values in the range and see how it impacts the target variable. By changing the prediction profiler in naive bayes, we observed that increase in tenure period helps in reducing the churn rate. On the contrary, increase in monthly charges increases the churn rate. Finally, we did ensembel of different combination of models to average and eliminate the shortcomings of individual models. We found that in ensembling neural network and naive bayes has higher sensitivity among ???. This ends the model description. Over to you, Jimmy. JJoseph Thank you, Kamal. In this section we will be comparing the models and looking deeper dive into each model detail. The major parameters used to compare the models are cost of misclassification in dollars, sensitivity versus accuracy chart, lift ratio, and area under the curve values. The cost of misclassification data is depicted on the right, top corner of the slide. Cost of false positives and false negative determined using average monthly charges. That cost of false negative model predicted no turn for customer potentially leaving, calculated to dollar (85) and cost of false negative at dollar (14) after discounting 20% to accommodate additional benefits. The cost comparison chart clearly indicate that the niave bayes has the lowest cost. Going on to total accuracy rates chart with it is between 74 to 81%, not much variation in most of the models. And lift, a measure of probability to find a success record compared to baseline model, varies between 1.99 to 3.11. The AUC or ROC curve is another measure us to determine the strength of the model with different type of values. As chart indicates all the models did equally well in this category. The sensitivity and accuracy chart measure the models' success to predict the customer churn accurately. The chart indicates two facts How many customers that the model can correctly predict; to how often the prediction be accurate. This measure is used as the major parameter to decide the best performing model and naive bayes did well in this category. Based on the various metrics and considering the cost of failed prediction of models, naive bayes came out as the best and parsimonious model to predict the customer churn for the given data set. It has lowest misclassification ratio, high sensitivity, and reasonably good total accuracy. If you discount some of its inherent drawbacks, such as lack of a statistical model to support, the model is completely data driven and easily explainable. Moving on to the conclusions drawn, the significant variables in the data set are contract and tenure of customer enrolled. From modeling, we observed that churning of customer is high for 1) those without dependent in demography; 2) those who pay a high price for their phone services, low customer satisfaction rate on high end services; 3) customers stick to the original single line on service easy switch over to competitors. So based on those findings, the recommendations are 1) targeted customer promotion focused on in income generation; 2) push long term contract with additional incentives; 3) build a product line combo focusing on customer needs. In conclusion, we use JMP tool to do analysis and predictive models on limited data set. It is very effective and powerful to to do those analysis, please reach out to us if you have any further questions. Thank you.  
Steve Hampton, Process Control Manager, PCC Structurals Jordan Hiller, JMP Senior Systems Engineer, JMP   Many manufacturing processes produce streams of sensor data that reflect the health of the process. In our business case, thermocouple curves are key process variables in a manufacturing plant. The process produces a series of sensor measurements over time, forming a functional curve for each manufacturing run. These curves have complex shapes, and blunt univariate summary statistics do not capture key shifts in the process. Traditional SPC methods can only use point measures, missing much of the richness and nuance present in the sensor streams. Forcing functional sensor streams into traditional SPC methods leaves valuable data on the table, reducing the business value of collecting this data in the first place. This discrepancy was the motivator for us to explore new techniques for SPC with sensor stream data. In this presentation, we discuss two tools in JMP — the Functional Data Explorer and the Model Driven Multivariate Control Chart — and how together they can be used to apply SPC methods to the complex functional curves that are produced by sensors over time. Using the business case data, we explore different approaches and suggest best practices, areas for future work and software development.     Auto-generated transcript...   Speaker Transcript Jordan Hiller Hi everybody. I'm Jordan Hiller, senior systems engineer at JMP, and I'm presenting with Steve Hampton, process control manager at PCC Structurals. Today we're talking about statistical process control for process variables that have a functional form.   And that's a nice picture right there on the title   slide. We're talking about statistical process control, when it's not a single number, a point measure, but instead, the thing that we're trying to control has the shape of a functional curve.   Steve's going to talk through the business case, why we're interested in that in a few minutes. I'm just going to say a few words about methodology.   We reviewed the literature in this area for the last 20 years or so. There are many, many papers on this topic. However, there doesn't really appear to be a clear consensus about the best way to approach this statistical   process control   when your variables take the form of a curve. So we were inspired by some recent developments in JMP, specifically the model driven multivariate control chart introduced in JMP 15 and the functional data explorer introduced in JMP 14.   Multivariate control charts are not really a new technique they've been around for a long time. They just got a facelift in JMP recently.   And they use either principal components or partial least squares to reduce data, to model and reduce many, many process variables so that you can look at them with a single chart. We're going to focus on the on the PCA case, we're not really going to talk about partial   the   partial least squares here.   Functional Data Explorer is the method we use in JMP in order to work with data in the shape of a curve, functional   data. And it uses a form of principal components analysis, an extension of principal components analysis for functional data.   So it was a very natural kind of idea to say what if we take our functional curves, reduce and model that using the functional data explorer.   The result of that is functional principal components and just as you you would add regular principal components and push that through a model driven multivariate control chart,   what if we could do that with a functional principal components? Would that be feasible and would that be useful?   So with that, I'll turn things over to Steve and he will introduce the business case that we're going to discuss today. 1253****529 All right. Thank you very much. Jordan.   Since I do not have video, I decided to let you guys know what I look like.   There's me with my wife Megan and my son Ethan   with last year's pumpkin patch. So I wanted to step into the case study with a little background on   what I do, and so you have an idea of where this information is coming from. I work in investment casting for precision casting...   Investment Casting Division.   Investment casting involves making a wax replicate of what you want to sell, putting it into a pattern assembly,   dipping it multiple times in proprietary concrete until you get enough strength to be able to dewax that mold.   And we fire it to have enough strength to be able to pour metal into it. Then we knock off our concrete, we take off the excessive metal use for the casting process. We do our non destructive testing and we ship the part.   The drive for looking at improved process control methods is the fact that   Steps 7, 8, and 9 take up 75% of the standing costs because of process variability in Steps 1-6. So if we can tighten up 1-6,   most of ??? and cost go there, which is much cheaper, much shorter, then there is a large value add for the company and for our customers in making 7, 8, and 9 much smaller.   So PCC Structurals. My plant, Titanium Plant, makes mostly aerospace components. On the left there you can see a fan ??? that is glowing green from some ??? developer.   And then we have our land based products, which right there's a N155 howitzer stabilizer leg.   And just to kind of get an idea where it goes. Because every single airplane up in the sky basically has a part we make or multiple parts, this is an engine sections ???, it's about six feet in diameter, it's a one piece casting   that goes into the very front of the core of a gas turbine engine. This one in particular is for the Trent XWB that powers the Airbus A350   jets.   So let's get into JMP. So the big driver here is, as you can imagine, with something that is a complex as an investment casting process for a large part, there is tons of   data coming our way. And more and more, it's becoming functional as we increase the number of centers, we have and we increase the number of machines that we use. So in this case study, we are looking at   data that comes with a timestamp. We have 145 batches. We have our variable interest which is X1.   We have our counter, which is a way that I've normalized that timestamp, so it's easier to overlay the run in Graph Builder and also it has a little bit of added   niceness in the FTP platform. We have our period, which allows us to have that historic period and a current period that lines up with the model driven multivariate control chart platform,   so that we can have our FDE   only be looking at the historic so it's not changing as we add more current data. So this is kind of looking at this if you were in using this in practice, and then the test type is my own validation   attempts. And what you'll see here is I've mainly gone in and tagged thing as bad, marginal or good. So red is bad, marginal is purple, and green is good and you can see how they overlay.   Off the bat, you can see that we have some curvey   ??? curves from mean. These are obviously what we will call out of control or bad.   This would be what manufacturing called a disaster because, like, that would be discrepant product. So we want to be able to identify those   earlier, so that we can go look at what's going on the process and fix it. This is what it looks like   breaking out so you can see that the bad has some major deviation, sometimes of mean curve and a lot of character towards the end.   The marginal ones are not quite as deviant from the mean curves but have more bouncing towards the tail and then good one is pretty tight. You can see there's still some bouncing. So this is where the   the marginal and the good is really based upon my judgment, and I would probably fail an attribute Gage R&R based on just visually looking at this. So   we have a total of 33 bad curves, 45 marginal and 67. And manually, you can just see about 10 of them are out. So you would have an option if you didn't want to use a point estimate, which I'll show a little bit later that doesn't work that great, of maybe making...   control them by points using the counter. And how you do that would be to split the bad table by counter, put it into an individual moving range control chart through control chart building and then you would get out,   like 3500 control charts in this case, which you can use the awesome ability to make combined data tables to turn that that list summary from each one into its own data table that you can then link back to your main data table and you get a pretty cool looking   analysis that looks like this, where you have control limits based upon the counters and historic data and you can overlay your curves. So if you had an algorithm that would tag whenever it went outside the control limits, you know, that would be an option of trying to   have a control....   a control chart functionality with functional data. But you can see, especially I highlighted 38 here, that you can have some major deviation and stay within the control limits. So that's where this FDE   platform really can shine, in that it can identify an FPC that corresponds with some of these major deviations. And so we can tag the curves based upon those at FPCs.   And we'll see that little later on. So,   using the FDE platform, it's really straightforward. Here for this demonstration, we're going to focus on a step function with 100 knots.   And you can see how the FPCs capture the variability. So the main FPC is saying, you know, beginning of the curve, there's...that's what's driving the most variability, this deviation from the mean.   And setup is X1 and their output, counters. Our input, batch number and then I added test type. So we can use that as some of our validation in FPC table and the model driven multivariate control chart and the period so that only our historic is what's driving the FDE fit.   And so   just looking at the fit is actually a pretty important part of making sure you get correct   control charting later on, is I'm using this P Step   Function 100 knots model. You can see, actually, if I use a B spline and so with Cubic 20 knots, it actually looks pretty close to my P spline.   But from the BIC you can actually see that I should be going to more knots, so if I do that, now we start to see them overfitting, really focusing on the isolated peaks and it will cause you to have an FDE   model that doesn't look right and causes you to not be as sensitive and your model driven multivariate control chart.   0
Monday, October 12, 2020
Jordan Hiller, JMP Senior Systems Engineer, JMP Mia Stephens, JMP Principal Product Manager, JMP   For most data analysis tasks, a lot of time is spent up front — importing data and preparing it for analysis. Because we often work with data sets that are regularly updated, automating our work using scripted repeatable workflows can be a real time saver. There are three general sections in an automation script: data import, data curation, and analysis/reporting. While the tasks in the first and third sections are relatively straightforward — point-and click to achieve the desired result and capture the resulting script — data curation can be more challenging for those just starting out with scripting. In this talk we review common data preparation activities, discuss the JSL code necessary to automate the process, and provide advice for generating JSL code for data curation via point-and-click.     The Data Cleaning Script Assistant Add-in discussed in this talk can be found in the JMP File Exchange.     Auto-generated transcript...   Speaker Transcript mistep Welcome to JMP Discovery summit. I'm Mia Stephens and I'm a JMP product manager and I'm here with Jordan Hiller, who is a JMP systems engineer. And today we're going to talk about automating the data curation workflow. And we're going to split our talk into two parts. I'm going to kick us off and set the stage by talking about the analytic workflow and where data curation fits into this workflow. And then I'm going to turn it over to Jordan for the meat, the heart of this talk. We're going to talk about the need for reproducible data curation. We're going to see how to do this in JMP 15. And then you're going to get a sneak peek at some new functionality in JMP 16 for recording data curation steps and the actions that you take to prepare your data for analysis. So let's think about the analytic workflow. And here's one popular workflow. And of course, it all starts with defining what your business problem is, understanding the problem that you're trying to solve. Then you need to compile data. And of course, you can compile data from a number of different sources and pull these data in JMP. And at the end, we need to be able to share results and communicate our findings with others. Probably the most time-consuming part of this process is preparing our data for analysis or curating our data. So what exactly is data curation? Well, data curation is all about ensuring that our data are useful in driving analytics discoveries. Fundamentally, we want to be able to solve a problem with the day that we have. This is largely about data organization, data structure, and cleaning up data quality issues. If you think about problems or common problems with data, it generally falls within four buckets. We might have incorrect formatting, incomplete data, missing data, or dirty or messy data. And to talk about these types of issues and to illustrate how we identify these issues within our data, we're going to borrow from our course, STIPS And if you're not familiar with STIPS, STIPS is our free online course, Statistical Thinking for Industrial Problem Solving, and it's set up in seven discrete modules. Module 2 is all about exploratory data analysis. And because of the interactive and iterative nature of exploratory data analysis and data curation, the last lesson in this module is data preparation for analysis. And this is all about identifying quality issues within your data and steps you might take to curate your data. So let's talk a little bit more about the common issues. Incorrect formatting. So what do we mean by incorrect formatting? Well, this is when your data are in the wrong form or the wrong format for analysis. This can apply your data table as a whole. So, for example, you might have your data in separate columns, but for analysis, you need your data stacked in one column. This can apply to individual variables. You might have the wrong modeling type or data type or you might have date data, data on dates or times that's not formatted that way in JMP. It can also be cosmeti. You might choose to remove response variables to the beginning of the data table, rename your variables, group factors together to make it easier to find them with the data table. Incomplete data is about having a lack of data. And this can be on important variables, so you might not be capturing data on variables that can ultimately help you solve your problem or on combinations of variables. Or it could mean that you simply don't have enough observations, you don't have enough data in your data table. Missing data is when values for variables are not available. And this can take on a variety of different forms. And then finally, dirty or messy data is when you have issues with observations or variables. So your data might be incorrect. The values are simply wrong. You might have inconsistencies in terms of how people were recording data or entering data into the system. Your data might be inaccurate, might not have a capable measurement system, there might be errors or typos. The data might be obsolete. So you might have collected the information on a facility or machine that is no longer in service. It might be outdated. So the process might have changed so much since you collected the data that the data are no longer useful. The data might be censored or truncated. You might have columns that are redundant to one another. They have the same basic information content or rows that are duplicated. So dirty and messy data can take on a lot of different forms. So how do you identify potential issues? Well, when you take a look at your data, you start to identify issues. And in fact, this process is iterative and when you start to explore your data graphically, numerically, you start to see things that might be issues that you might want to fix or resolve. So a nice starting point is to start by just scanning the data table. When you scan your data table, you can see oftentimes some obvious issues. And for this example, we're going to use some data from the STIPS course called Components, and the scenario is that a company manufactures small components and they're trying to improve yield. And they've collected data on 369 batches of parts with 15 columns. So when we take a look at the data, we can see some pretty obvious issues right off the bat. If we look at the top of the data table, we look at these nice little graphs, we can see the shapes of distributions. We can see the values. So, for example, batch number, you see a histogram. And batch number is something you would think of being an identifier, rather than something that's continuous. So this can tell us that the data coded incorrectly. When we look at number scrapped, we can see the shape of the distribution. We can also see that there's a negative value there, which might not be possible. we see a histogram for process with two values, and this can tell us that we need to change the modeling type for process from continuous to nominal. You can see more when you when you take a look at the column panel. So, for example, batch number and part number are both coded as continuous. These are probably nominal And if you look at the data itself, you can see other issues. So, for example, humidity is something we would think of as being continuous, but you see a couple of observations that have value N/A. And because JMP see text, the column is coded as nominal, so this is something that you might want to fix. we can see some issues with supplier. There's a couple of missing values, some typographical errors. And notice, temperature, all of the dots indicate that we're missing values for temperature in these in these rows. So this is an issue that we might want to investigate further. So you identify a lot of issues just by scanning the data table, and you can identify even more potential issues when you when you visualize the data one variable at a time. A really nice starting point, and and I really like this tool, is the column viewer. The column viewer gives you numeric summaries for all of the variables that you've selected. So for example, here I'm missing some values. And you can see for temperature that we're missing 265 of the 369 values. So this is potentially a problem if we think the temperature is an important factor. We can also see potential issues with values that are recorded in the data table. So, for example, scrap rate and number scrap both have negative values. And if this isn't isn't physically possible, this is something that we might want to investigate back in the system that we collected the data in. Looking at some of the calculated statistics, we can also see other issues. So, for example, batch number and part number really should be categorical. It doesn't make sense to have the average batch number or the average part number. So this tells you you should probably go back to the data table and change your modeling type. Distributions tell us a lot about our data and potential issues. We can see the shapes of distributions, the centering, the spread. We can also see typos. Customer number here, the particular problem here is that there are four or five major customers and some smaller customers. If you're going to use customer number and and analysis, you might want to use recode to group some of those smaller customers together into maybe an other category. we have a bar chart for humidity, and this is because we have that N/A value in the column. And we might not have seen that when we scan the data table, but we can see it pretty clearly here when we look at the distribution. We can clearly see the typographical errors for supplier. And when we look at continuous variables, again, you can look at the shape, centering, and spread, but you can also see some unusual observations within these variables. So, after looking at the data one variable at a time, a natural, natural progression is to explore the data two or more variables at a time. So for example, if we look at scrap rate versus number scrap in the Graph Builder. We see an interest in pattern. So we see these these bands and it could be that there's something in our data table that helps us to explain why we're seeing this pattern. In fact, if we color by batch size, it makes sense to us. So where we have batches with 5000 parts, there's more of an opportunity for scrap parts than for batches of only 200. We can also see that there's some strange observations at the bottom. In fact, these are the observations that had negative values for the number of scrap and these really stand out here in this graph. And when you add a column switcher or data filter, you can add some additional dimensionality to these graphs. So I can look at pressure, for example, instead of... Well, I can look at pressure or switch to dwell. What I'm looking for here is I'm getting a sense for the general relationship between these variables and the response. And I can see that pressure looks like it has a positive relationship with scrap rate. And if I switch to dwell, I can see there's probably not much of a relationship between dwell and scrap rate or temperature. So these variables might not be as informative in solving the problem. But look at speed, speed has a negative relationship. And I've also got some unusual observations at the top that I might want to investigate. So you can learn a lot about your data just by looking at it. And of course, there are more advanced tools for exploring outliers and missing values that are really beyond the scope of this discussion. And as you get into the analyze phase, when you start analyzing your data or building models, you'll learn much much more about potential issues that you have to deal with. And the key is that as you are taking a look at your data and identifying these issues, you want to make notes of these issues. Some of them can be resolved as you're going along. So you might be able to reshape and clean your data as you proceed through the process. But you really want to make sure that you capture the steps that you take so that you can repeat the steps later if you have to repeat the analysis or if you want to repeat the analysis on new data or other data. And at this point is where I'm going to turn it over to to Jordan to talk about reproducible data curation and what this is all about. Jordan Hiller Alright thanks, Mia. That was great. And we learned what you do in JMP to accomplish data curation by point and click. Let's talk now about making that reproducible. The reason we worry about reproducibility is that your data sets get updated regularly with new data. If this was a one-time activity, we wouldn't worry too much about the point and click. But when data gets updated over and over, it is too labor-intensive to repeat the data curation by point and click each time. So it's more efficient to generate a script that performs all of your data curation steps, and you can execute that script with one click of a button and do the whole thing at once. So in addition to efficiency, it documents your process. It serves as a record of what you did. So you can refer to that later for yourself and remind yourself what you did, or for people who come after you and are responsible for this process, it's a record for them as well. For the rest of this presentation, my goal is to show you how to generate a data curation script with point and click only. We're hoping that you don't need to do any programming in order to get this done. That program code is going to be extracted and saved for you, and we'll talk a little bit about how that happens. So there are two different sections. There's what you can do now in JMP 15 to obtain a data curation script, and what you'll be doing once we release JMP 16 next year. In JMP 15 there are some data curation tasks that generate their own reusable JSL scripting code. You just execute your point and click, and then there's a technique to grab the code. I'm going to demonstrate that. So tools like recode, generating a new formula column with the calculation, reshaping data tables, these tools are in the tables menu. There's stack, split, join, concatenate, and update. All of these tools in JMP 15 generate their own script after you execute them by point and click. There are other common tasks that do not generate their own JSL script and in order to make it easier to accomplish these tasks and make them reproducible, And it helps with the following tasks, mostly column stuff, changing the data types of columns, the modeling types, changing the display format, renaming, reordering, and deleting columns from your data table, also setting column properties such as spec limits or value labels. So the Data Cleaning Script Assistant is what you'll use to assist you with those tasks in JMP 15. We are going to give you a sneak preview of JMP 16 and we're very excited about new features in the log in JMP 16, I think it's going to be called the enhanced log mode. The basic idea is that in JMP 16 you can just point and click your way through your data curation steps as usual. The JSL code that you need is generated and logged automatically. All you need to do is grab it and save it off. So super simple and really useful, excited to show that to you. Here's a cheat sheet for your reference. In JMP 15 these are the the tasks on the left, common data curation tasks; it's not an exhaustive list. And the middle column shows how you accomplish them by point and click in JMP. The method for extracting the reusable script is listed on the right. So I'm not going to cover everything in here. But yeah, this is for you for your reference later. Let's get into a demo. And I'll show how to address some of those issues that Mia identified with the components data table. I'm going to start in JMP 15. And the first thing that we're going to talk about are some of those column problems, changing changing the data types, the modeling types, that kind of thing. Now, if you were just concerned with point and click in JMP, what you would ordinarily do is, for for let's say for humidity. This is the column you'll remember that has some text in that and it's coming in mistakenly as a character column. So to fix that by point and click, you would ordinarily right click, get into the column info, and address those changes here. This is one of those JMP tasks that doesn't leave behind usable script in in JMP 15. So for this, we're going to use the data cleaning script assistant instead. So here we go. It's in the add ins menu, because I've installed it, you can install it too. Data cleaning script assistant, the tool that we need for this is Victor the cleaner. This is a graphical user interface for making changes to columns, so we can address data types and modeling types here. We can rename columns, we can change the order of columns, and delete columns, and then save off the script. So let's make some changes here. For humidity, that's the one with the the N/A values that caused it to come in as text. We're going to change it from a character variable to a numeric variable. And we're going to change it from nominal to continuous. We also identified batch number needs to come...needs to get changed to to nominal; part number as well needs to get changed to nominal and the process, which is a number right now, that should also be nominal. fab tech. So that's not useful for me. Let's delete the facility column. I'm going to select it here by clicking on its name and click Delete. Here are a couple of those cosmetic changes that Mia mentioned. Scrap rate is at the end of my table. I want to move it earlier. I'm going to move it to the fourth position after customer number. So we select it and use the arrows to move it up in the order to directly after customer number. Last change that I'm going to make is I'm going to take the pressure variable and I'm going to rename it. My engineers in my organization called this column psi. So that's the name that I want to give that column. Alright, so that's all the changes that I want to make here. I have some choices to make. I get to decide whether the script gets saved to the data table itself. That would make a little script section over here in the upper left panel. Where to save it to its own window, let's save it to a script window. You can also choose whether or not the cleaning actions you specified are executed when you click ok. Let's let's keep the execution and click OK. So now you'll see all those changes are made. Things have been rearrange, column properties have changed, etc. And we have a script. We have a script to accomplish that. It's in its own window and this little program will be the basis. We're going to build our data curation script around it. Let's let's save this. I'm going to save it to my desktop. And I'm going to call this v15 curation script. changing modeling types, changing data types, renaming things, reordering things. These all came from Victor. I'm going to document this in my code. It's a good idea to leave little comments in your code so that you can read it later. I'm going to leave a note that says this is from the Victor tool. And let's say from DCSA, for data cleaning script assistant Victor. So that's a comment. The two slashes make a line in your program; that's a comment. That means that the program interpreter won't try to execute that as program code. It's recognized as just a little note and you can see it in green up there. Good idea to leave yourself little comments in your script. All right, let's move on. The next curation task that I'm going to address is a this supplier column. Mia told us how there were some problems in here that need to be addressed. We'll use the recode tool for this. Recode is one of the tools in JMP 15 that leaves behind its own script, just have to know where to get it. So let's do our recode and grab the script, right click recode. And we're going to fix these data values. I'm going to start from the red triangle. Let's start by converting all of that text to title case, that cleaned up this lower case Hersch value down here. Let's also trim extra white space, extra space characters. That cleaned up that that leading space in this Anderson. Okay. And so all the changes that you make in the recode tool are recorded in this list and you can cycle through and undo them and redo them and cycle through that history, if you like. All right, I have just a few more changes to make. I'll make the manually. Let's group together the Hershes, group together the Coxes, group together all the Andersons. Trutna and Worley are already correct. The last thing I'm going to do is address these missing values. We'll assign them to their own category of missing. That is my recode process. I'm done with what I need to do. If I were just point and clicking, I would go ahead and click recode and I'd be done. But remember, I need to get this script. So to do that, I'm going to go to the red triangle. Down to the script section and let's save this script to a script window. Here it is saved to its own script window and I'm just going to paste that section to the bottom of my curation script in process. So let's see. I'm just going to grab everything from here. I don't even really have to look at it. Right. I don't have to be a programmer, Control C, and just paste it at the bottom. And let's leave ourselves a note that this is from the recode red triangle. Alright, and I can close this window. I no longer need it. And save these updates to my curation scripts. So that was recode and the way that you get the code for it. All right, then the next task that we're going to address is calculating a yield. Oh, I'm sorry. What I'm going to do is I'm going to actually execute that recode. Now that I've saved the script, let's execute the recode. And there it is, the recoded supplier column. Perfect. All right, let's calculate a yield column. This is a little bit redundant, I realize we already have the scrap rate, but for purposes of discussion, let's show you how you would calculate a new column and extract its script. This is another place in JMP 15 where you can easily get the script if you know where to look. So making our yield column. New column, double click up here, rename it from column 16 to yield. And let's assign it a formula. To calculate the yield, I need to find how many good units I have in each batch, so that's going to be the batch size minus the number scrapped. So that's the number of good units I have in every batch. I'm going to divide that by the total batch size and here is my yield column. Yes, you can see that yield here is .926. Scrap rate is .074, 1 minus yield. So good. The calculation is correct. Now that I've created that yield column, let's grab its script. And here's the trick, right click, copy columns. from right click, copy columns. Paste. And there it is. Add a new column to the data table. It's called yield and here's its formula. Now, I said, you don't need to know any programming, I guess here's a very small exception. You've probably noticed that there are semicolons at the end of every step in JSL. That separates different JSL expressions and if you add something new to the bottom of your script, you're going to want to make sure that there's a semicolon in between. So I'm just typing a semicolon. The copy columns function did not add the semicolon so I have to add it manually. All right, good. So that's our yield column. The next thing I'd like to address is this. My processes are labeled 1 and 2. That's not very friendly. I want to give them more descriptive labels. We're going to call Process Number 1, production; and Process Number 2, experimental. We'll do that with value labels. Value labels are an example of column properties. There's an entire list of different column properties that you can add to a column. This is things like the units of measurement. This is like if you want to change the order of display in a graph, you can use value ordering. If you want to add control limits or spec limits or a historical sigma for your quality analysis, you can do that here as well. Alright. So all of these are column properties that we add, metadata that we add to the columns. And we're going to need to use the Data Cleaning Script Assistant to access the JSL script for adding these column properties. So here's how we do it. At first, we add the column properties, as usual, by point and click. I'm going to add my value labels. Process Number 1, we're going to call production. Add. Process Number 2, we're going to call experimental. And by adding that value label column property, I now get nice labels in my data table. Instead of seeing Process 1 and Process 2, I see production and experimental. Data Cleaning Script Assistant. We will choose the property copier. A little message has popped up saying that the column property script has been copied to the clipboard and then we'll go back to our script in process. from the DCSA property copier. And then paste, Control V to paste. There is the script that we need to assign those two value labels. It's done. Very good. Okay, I have one more data curation step to go through, something else that we'll need the Data Cleaning Script Assistant for. We want to consider only, let's say, the rows in this data table where vacuum is off. Right. So there are 313 of those rows. And I just want to get rid of the rows in this data table where vacuum is on. So the way you do it by point and click is is selecting those, much as I did right now, and then running the table subset command. In order to get usable code, we're going to have to use the Data Cleaning Script Assistant once again. So here's how to subset this data table to only the rows were vacuum is off. First, I'm going to use, under the row menu, under the row selection submenu, we'll use this Select Where command in order to get some reusable script for the selection. We're going to select the rows were vacuum is off. And before clicking okay to execute that selection, again, I will go to the red triangle, save script to the script window. Control A. Control C to copy that and let's paste that at again From rows. Select Where Control V. So there's the JSL code that selects the rows where vacuum is off. Now I need, one more time, need to use the Data Cleaning Script Assistant to get the selected rows. Oh, let us first actually execute the selection. There it is. Now with the row selected, we'll go up to again add ins, Data Cleaning Script Assistant, subset selected rows. I'm being prompted to name my new data table that has the subset of the data. Let's call it a vacuum, vacuum off. That's my new data table name. Click OK, another message that the subset script has been copied to the clipboard. And so we paste it to the bottom. There it is. And this is now our complete data curation script to use in JMP 15 and let's just run through what it's like to use it in practice. I'm going to close the data table that we've been working on and making corrections to doing our curation on. Let's close it and revert back to the messy state. Make sure I'm in the right version of JMP. All right. Yes, here it is, the messy data. And let's say some new rows have come in because it's a production environment and new data is coming in all the time. I need to replay my data curation workflow. run script. It performed all of those operations. Note the value labels. Note that humidity is continuous. Note that we've subset to only the rows where vacuum is off. The entire workflow is now reproducible with a JSL script. So that's what you need to keep in mind for JMP 15. Some tools you can extract the JSL script from directly; for others, you'll use my add in, the Data Cleaning Script Assistant. And now we're going to show you just how much fun and how easy this is in JMP 16. I'm not going to work through the entire workflow over again, because it would be somewhat redundant, but let's just go through some of what we went through. Here we are in JMP 16 and I'm going to open the log. The log looks different in JMP 16 and you're going to see some of those differences presently. Let's open the the messy components data. Here it is. And you'll notice in the log that it has a section that says I've opened the messy data table. And down here. Here is that JSL script that accomplishes what we just did. So this is like a running log that that automatically captures all of the script that you need. It's not complete yet. There are new features still being added to it. And I, and I assume that will be ongoing. But already this this enhanced log feature is very, very useful and it covers most of your data curation activities. I should also mention that, right now, what I'm showing to you is the early adopter version of JMP. It's early adopter version 5. So when we fully release the production version of JMP 16 early next year, it's probably going to look a little bit different from what you're seeing right now. Alright, so let's just continue and go through some of those data curation steps again. I won't go through the whole workflow, because it would be redundant. Let's just do some of them. I'll go through some of the things we used to need Victor for. In JMP 16 we will not need the Data Cleaning Script Assistant. We just do our point and click as usual. So, humidity, we're going to change from character to numeric and from nominal to continuous and click OK. Here's what that looks like in the structured log. It has captured that JSL. All right, back to the data table. We are going to change the modeling type of batch number and part number and process from continuous to nominal. That's done. That has also been captured in the log. We're going to delete the facility column, which has only one value, right click Delete columns. That's gone. PSI. OK, so those were all of the tool...all of the things that we did in Victor in JMP 15. Here in JMP 16, all of those are leaving behind JMP script that we can just copy and reuse down here. Beautiful. All right. Just one more step I will show you. Let's show the subset to vacuum is off. Much, much simpler here in JMP 16. All we need to do is select all the off vacuums; I don't even need to use the rows menu, I can just right click one of those offs and select matching cells, that selects the 313 rows where vacuum is off. And then, as usual, to perform the subset, to select to subset to only the selected rows, table subset and we're going to create a new table called vacuum off that has only our selected rows and it's going to keep all the columns. Here we go. That's it. We just performed all of those data curation steps. Here's what it looks like in the structured log. And now to make this a reusable, reproducible data curation script, all that we need to do is come up to the red triangle, save the script to a script window. I'm going to save this to my data...to my desktop as a v16 curation script. And here it is. Here's the whole script. So let's uh let's close all the data in JMP 16 and just show you what it's like to rerun that script. Here I am back in the home window for JMP 16. Here's my curation script. You'll notice that the first line is that open command, so I don't even need to open the data table. It's going to happen in line right here. All I need to do is, when there's new data that comes in and and this file has been updated, all that I need to do to do my data curation steps is run the script. And there it is. All the curation steps and the subset to the to the 313 rows. So that is using the enhanced log in JMP 16 to capture all your data curation work and change it into a reproducible script. Alright, here's that JMP 15 cheat sheet to remind you once again, these, this is what you need to know in order to extract the reusable code when you're in JMP 15 right now, and you won't have to worry about this so much once we release JMP 16 in early 2021. So to conclude, Mia showed you how you achieve data curation in JMP. It's an exploratory and iterative process where you identify problems and fix them by point and click. When your data gets updated regularly with new data, you need to automate that workflow in order to save time And also to document your process and to leave yourself a trail of breadcrumbs when you when you come back later and look at what you did. The process of automation is translating your point and click activities into a reusable JSL script. We discussed how in JMP 15 you're going to use a combination of both built in tools and tools from the Data Cleaning Script Assistant to achieve these ends. And we also gave you a sneak preview of JMP 16 and how you can use the enhanced log to just automatically passively capture your point and click data curation activities and leave behind a beautiful reusable reproducible data curation script. All right. That is our presentation, thanks very much for your time.  
Sports analytics tools are becoming more frequently used to help athletes enhance their skills and body strength to perform better and prevent injury. ACL tearing is one of the most common and dangerous injuries in basketball history. This injury occurs most frequently in jumping, landing, and pivoting due to the rapid change of direction and/or sudden deceleration in basketball. Recovering from an ACL injury is a brutal process, can take months – even years – to recover, and significantly decrease the player’s performance after recovery. The goal of this project is to find the relationship between fatigue and different angle measurements in the hips, knees, and back as well as the force applied to the ground to minimize the ACL injury risk. 7 different sensors were attached to a test subject while he conducted the countermovement jump for 10 trials on each leg before and after 2 hours of vigorous exercise. The countermovement jump was chosen due to its ability to assess the ACL injury risk quite well through force and flexion of different body parts. Several statistical tools such as the control chart builder, multivariate correlation, and variable clustering were utilized to discover any general insights between the before and after fatigue state for each exercise (which is related to an increased ACL injury risk). The JMP Multivariate SPC platform provided further biomechanic, time-specific information about how joint flexions differ before and after fatigue at specific time points, giving a more in-depth understanding of how the different joint contributions change when fatigued. The end-to-end experimental and analysis approach can be extended across different sports to prevent injury.   (view in My Videos)   Auto-generated transcript:  
Stanley Siranovich, Principal Analyst, Crucial Connection LLC   Much has been written in both the popular press and in the scientific journals about the safety of modern vaccination programs. To detect possible safety problems in U.S.-licensed vaccines, the CDC and the FDA have established the Vaccine Adverse Event Reporting System (VAERS). This database system now covers 20 years, with several data tables for each year. Moreover, these data tables must be joined to extract useful information from the data. Although a search and filter tool (WONDER) is provided for use with this data set, it is not well suited for modern data exploration and visualization. In this poster session, we will demonstrate how to use JMP Statistical Discovery Software to do Exploratory Data Analysis for the MMR vaccine over a single year using platforms such as Distribution, Tabulate, and Show Header Graphs. We will then show how to use JMP Scripting Language (jsl) to repeat, simply and easily, the analysis for additional years in the VAERS system.     Auto-generated transcript...   Speaker Transcript Stan Siranovich Good morning everyone. Today we're going to do a exploratory data analysis of the VAERS database. Now let's do a little background on what this database is. VAERS, spelled V-A-E-R-S, is an acronym for Vaccine Adverse Effect Reporting System. It was created by the FDA and the CDC. It gets about 30,000 updates per year and it's been public since 1990 so there's quite a bit of data on it. And it was designed as an early warning system to look for some effects of vaccines that have not previously been reported. Now these are adverse effects, not side effects, that is they haven't been linked to the vaccination yet. It's just something that happened after the vaccination. Now let's talk about the structure. VAERS VAX and VAERS DATA. Now there is a tool for examining the online database and it goes by the acronym of WONDER. And it is traditional search tool where you navigate the different areas of the database, select the type of data that you want, click the drop down, and after you do that a couple of times, or a couple of dozen times, what you do is send in the query. And without too much latency, get a result back. But for doing exploratory data analysis and some visualizations, there's a slight problem with that. And that is that you have to know what you want to get in the first place, or at least at the very good idea. So that's where JMP comes in. And as I mentioned, we're going to do an EDA and some visualization on on specific set of data, that is data for the MMR vaccine for measles, mumps, and rubella. And we're going to do for the most recent full year available, which will be 2019. So let me move to a new window. Okay, the first thing we did and which I omitted here was to download the CSVs and open them up in JMP. Now I want to select my data and JMP makes it very easy. After I get the window open, I simply go through rows, rows selection and select where and down here is a picture that I want the VAX_TYPE and I wanted it to equal MMR. Now there's some other options here besides equals, which we'll talk about in a second. And after we click the button, and we've selected those rows, the next thing we want to do is decide on which data that that we want. So I've highlighted some of the columns and in a minute or so you'll see why. And then when I do that, oh, before we go there, let's note row nine and row 18 right here. Notice we have MMRV and MMR. MMRV is a different vaccine. And if we wanted to look at that also, we could have selected contains here from the drop down. But that's not what we wanted to do. So we click OK and we get our table. Now what we want to do is join that VAERS VAX table which contains data about the vaccine, such as a manufacturer, the lot and so forth with the VAERS DATA table, which contains data on on the effects of vaccine, so it's it's got things like whether or not the patient had allergies, whether or not the patient was hospitalized, number of hospital days, that sort of thing. And it also contains demographic data such as age and sex. So what we want to do is join and simply go to tables join and we select The VAERS VAX and VAERS DATA tables and we want to join them on the VAERS ID. And again, JMP makes it pretty easy. We just click the column in each one at one of the separate tables and we put them here in the match window and after that we go over to the table windows and we select the columns that we want. And this is what our results table looks like. Now let me reduce that and open up and JMP table. There we go, and I'll expand that. And for the purposes of this demonstration I just selected these...these columns here. We've got the VAERS ID, which you see identification obviously, the type which are all MMR. And looks like Merck is the manufacturer. And there's a couple of unknowns scattered through here. And I selected VAX LOT, because that would be important if there's something the matter with one lot, you want to be able to see that. This looks like cage underscore year, but that is calculated age in years. There are several H columns and I just selected one. And I selected sex because we'd like to know if somebody is is more affected, if males are more affected than females or vice versa. And HOSPDAYS is the number of days in the hospital if they had an adverse effect that was severe enough to put them into the hospital. And NUMDAYS is the number of days between vaccination and the appearance of the adverse effects and it looks like we we have quite, quite a range right here. So let's, let's get started on our analysis. show header graphs. So I'm going to click on that and show header graphs. And we get some distribution, and some other information up here. We'll skip the ID and see that the VAX_TYPE is all MMR, you have no others there. And the vax manufacturer, yes, it's either a Mercks & Co Inc or unknown and one nice feature about this is we can click on the bar and it will highlight the rows for us and click away and it's unhighlighted. Moving on to VAX_LOT, we have quite a bit of information squeezed into this tiny little box here. First of all, we have the top five lots in occurrence in in our data table and here they are, and here's how many times they appear. And it also tells us that we have 413 other lots included in table, plus five by my calculation, that's something like 418 individual lots. Now we go over the calculated age in years and in we see most of our values are between zero and whatever, they're during zero bin, which makes sense because it is a vaccination and we'll just make a note of that. And we go over to the sex column and it looks like we have significantly more females than males. Now, that tells us right away if we want to do, side by side group comparisons, we're going to have to randomly select from females, so that they equal the males and we also have some unknowns here, quite a few unknowns. And we simply note that and move on. And we see hospital days. And we've see NUMDAYS. Now here's another really really nice feature. Let's say we'd like more details and we want to do a little bit of exploration to see how the age is distributed, we simply right click, right click, select open in distribution. And here we are in the distribution windows, but quite a bit of information here. For our purposes right now, we don't really do much here about the quantiles. So let's click close and it's still taking up some space. So let's go down here and select outline close orientation and let's go with vertical. And we're left with a nice easy to read window. It's got some information in there. We of course see our distribution down here and we've got a box and whisker plot up here. There's not a whole lot of time to go into that, that, that just displays data in a different way. And we see from our summary statistics that the mean happens to be 16.2, with the standard deviation 20.6. Not an ideal situation. So if you want to do anything more with that, we may want to split the years in two groups where most of them are down here and and then where, where this, where all the skewed data is and then the rest of them and along the right and examine that separately. And I will minimize that window and we can do the same with hospital days and number of days. And let me just do that real quick. And here we see the same sorts of data and I won't bother clicking through that and reducing it. But we might note also when again we have the mean of 6.7 and standard deviation of 13.2, again, not a very ideal situation and we simply make note of that. And I will close that. Now let's say we want to do a little bit more exploratory analysis, something caught our eye and all that. And that is simple to do here. We don't have to go back to the online database, and select through everything, click the drop downs, or whatever. We can simply come up here to analyze and fit Y by X. So let's say that we would like to examine the relationship between oh, hospital days, number of days spent in the hospital and calculated age in years. We simply do that. We have two continuous variables so we're going to get a bivariate plot out of that. We click OK. And we get another nice display of the data. And yes, we can see that currently, the mean is down around 5 or 6, which is a good, good thing better than 10 or 12. We can, for purposes of references, go up here to the red triangle, select fit mean and we get the mean, right here. And we noticed there's quite a few outliers. Let's say we want to examine them right now and decide whether or not we want to delve into them a little bit further. So if we hover over one of our outlier points or any of the points for that matter, we see we get this pop up window and it tells us that particular data point represents row 868. Calculated age is in the one year bucket, and this patient happened to spend 90 days in the hospital. Now we could right click and color this row or put some sort of marker in there. I won't bother doing that, but I will move the cursor over here into the window, and we see this little symbol up in the right hand corner, click that and that pins it. So we can, of course, repeat that. And we can get the detail for further examination. I found this to be quite handy when giving presentations to groups of people like to call attention to one particular point. That's a little bit overbearing so let's right click, select...select font, not edit. And we get the font window come up and see we're using 16 point font. Let's, I don't know, let's go down to 9. And that's a little bit better and it gives us more room if we'd like to call attention to some of the other outliers. So in summary, let me bring up the PowerPoint again. In summary, we were able to import and shape two large data tables from a large online government maintained database. We were able to subset tables, able to join the tables and select our output data all seamlessly. And we were able to generate summaries and distributions, pointing out the areas that may be of interest and for more detailed analysis. And of course, that was all seamless and all occured with within the same software platform. Now, supply some links right over here to the various data site. This, this is the main site, which has all the documentation that the government did quite a good job there. And here is the actual data itself in the zip..  
Ruth Hummel, JMP Academic Ambassador, SAS Rob Carver, Professor Emeritus, Stonehill College / Brandeis University   Statistics educators have long recognized the value of projects and case studies as a way to integrate the topics in a course. Whether introducing novice students to statistical reasoning or training employees in analytic techniques, it is valuable for students to learn that analysis occurs within the context of a larger process that should follow a predictable workflow. In this presentation, we’ll demonstrate the JMP Project tool to support each stage of an analysis of Airbnb listings data. Using Journals, Graph Builder, Query Builder and many other JMP tools within the JMP Project environment, students learn to document the process. The process looks like this: Ask a question. Specify the data needs and analysis plan. Get the data. Clean the data. Do the analysis. Tell your story. We do our students a great favor by teaching a reliable workflow, so that they begin to follow the logic of statistical thinking and develop good habits of mind. Without the workflow orientation, a statistics course looks like a series of unconnected and unmotivated techniques. When students adopt a project workflow perspective, the pieces come together in an exciting way.       Auto-generated transcript...   Speaker Transcript So welcome everyone. My name is 00 07.933 3 Ambassador with JMP. I am now a retired professor of Business 00 30.566 7 between a student and a professor working on a project. 00 49.700 11 12 engage students in statistical reasoning, teach that 00 12.433 16 to that, current thinking is that students should be learning about reproducible workflows, 00 36.266 21 elementary data management. And, again, viewing statistics as 00 58.800 25 26 wanted to join you today on this virtual call. Thanks for having 00 20.600 30 and specifically in Manhattan, and you'd asked us so so you 00 36.433 34 And we chose to do the Airbnb renter perspective. So we're 00 51.733 38 expensive. So we started filling out...you gave us 00 09.166 43 44 separate issue, from your main focus of finding a place in 00 36.066 49 you get...if you get through the first three questions, you've 00 54.100 53 know, is there a part of Manhattan, you're interested in? 00 11.133 58 repository that you sent us to. And we downloaded the really 00 26.433 32.866 63 thing we found, there were like four columns in this data set 00 46.766 67 figured out so that was this one, the host neighborhood. So 00 58.100 71 72 figured out that the first two just have tons of little tiny 00 13.300 76 Manhattan. So we selected Manhattan. And then when we had 00 29.700 80 that and then that's how we got our Manhattan listings. So 00 44.033 84 data is that you run into these issues like why are there four 00 03.300 88 restricted it to Manhattan, I'll go back and clean up some 00 18.033 92 data will describe everything we did to get the data, we'll talk 00 28.400 33.200 97 know I'm supposed to combine them based on zip, the zip code, 00 47.166 101 102 107 columns, it's just hard to find the 00 09.366 106 them, so we knew we had to clean that up. All right, we also had 00 27.366 111 journal of notes. In order to clean this up, we use the recode 00 45.500 115 Exactly. Cool. Okay, so we we did the cleanup 00 02.200 119 Manhattan tax data has this zip code. So I have this zip code 00 19.300 123 day of class, when we talked about data types. And notice in the 00 42.300 128 the...analyze the distribution of that column, it'll make a funny 00 03.200 133 Manhattan doesn't really tell you a thing. But the zip code clean data in 00 18.466 23.266 139 just a label, an identifier, and more to the point, when you want to join or merge 00 41.833 48.766 145 important. It's not just an abstract idea. You can't merge 00 03.166 11.266 150 nominal was the modeling type, we just made sure. 00 26.200 31.033 155 about the main table is the listings. I want to keep 00 45.533 159 to combine it with Manhattan tax data. Yeah. Then what? Then we need to 00 03.266 164 tell it that the column called zip clean, zip code clean... Almost. There we go. And the column called zip, which 00 33.200 171 172 Airbnb listing and match it up with anything in 00 57.033 177 178 them in table every row, whether it matches with the other or 00 13.233 182 main table, and then only the stuff that overlaps from the second 00 29.600 186 another name like, Air BnB IRS or something? Yeah, it's a lot 00 50.966 190 do one more thing because I noticed these are just data tables scattered around 00 06.666 195 running. Okay. So I'll save this data table. Now what? And really, this is the data 00 19.833 22.033 26.266 35.466 203 anything else, before we lose track of where we are, let's 00 49.733 58.800 01.833 209 or Oak Team? And then part of the idea of a project 00 23.700 214 thing. So if you grab, I would say, take the 00 50.100 218 219 220 two original data sets, and then my final merged. Okay Now 00 16.200 225 them as tabs. And as you generate graphs and 00 36.566 229 230 231 even when I have it in these tabs. Okay, that's really cool. 00 58.833 02.500 236 right, go Oak Team. Well, hi, Dr. Carver, thanks so 00 19.233 240 you would just glance at some of these things, and let me know if 00 32.300 244 we used Graph Builder to look at the price per neighborhood. And 00 45.400 248 help it be a little easier to compare between them. So we kind 00 01.000 252 have a lot of experience with New York City. So we plotted 00 18.166 256 stand in front of the UN and take a picture with all the 00 31.733 260 saying in Gramercy Park or Murray Hill. If we look back at the 00 46.566 265 thought we should expand our search beyond that neighborhood to 00 58.766 269 270 just plotted what the averages were for the neighborhoods but 00 14.533 274 the modeling, and to model the prediction. So if we could put 00 30.766 279 expected price. We started building a model and what we've 00 42.800 283 factors. And so then when we put those factors into just a 00 58.833 287 more, some of the fit statistics you've told us about in class. 00 15.466 292 but mostly it's a cloud around that residual zero line. So 00 30.766 296 which was way bigger than any of our other models. So we know 00 45.800 300 reasons we use real data. Sometimes, this is real. This is 00 58.266 304 looking? Like this is residual values. 00 19.266 309 is good. Ah, cool. Cool. Okay, so I'll look for 00 34.966 313 is sort of how we're answering our few important questions. And 00 47.300 317 was really difficult to clean the data and to join the data. 00 57.866 03.500 322 wanted to demonstrate how JMP in combination with a real world 00 28.700 327 Number one in a real project, scoping is important. We want to 00 47.600 331 hope to bring to the to the group. Pitfall number two, it's vital to explore the 00 08.033 336 the area of linking data combining data from multiple 00 27.800 341 recoding and making sure that linkable 00 45.100 345 346 reproducible research is vital, especially in a team context, especially for projects that may 00 05.966 351 habits of guaranteeing reproducibility. And finally, we hope you notice that in these 00 32.633 356 on the computation and interpretation falls by the 00 51.900 360  
Nascif Neto, Principal Software Developer, SAS Institute (JMP Division) Lisa Grossman, Associate Test Engineer, SAS Institute (JMP division)   The JMP Hover Label extensions introduced in JMP 15 go beyond traditional details-on-demand functionality to enable exciting new possibilities. Until now, hover labels exposed a limited set of information derived from the current graph and the underlying visual element, with limited customization available through the use of label column properties. This presentation shows how the new extensions let users implement not only full hover label content customization but also new exploratory patterns and integration workflows. We will explore the high-level commands that support the effortless visual augmentation of hover labels by means of dynamic data visualization thumbnails, providing the starting point for exploratory workflows known as data drilling or drill down. We will then look into the underlying low-level infrastructure that allows power users to control and refine these new workflows using JMP Scripting Language extension points. We will see examples of "drill out" integrations with external systems as well as how to build an add-in that displays multiple images in a single hover label.     Auto-generated transcript...   Speaker Transcript Nascif Abousalh-Neto Hello and welcome. This is a our JMP discovery presentation from details on demand to wandering workflows, getting to know JMP hover label extensions. Before we start on the gory details, we always like to talk about the purpose of a new feature introduced in JMP. So in this case, we're talking about hover labels extensions. And why do we even have hover labels in the first place. Well, I always like to go back to the visual information seeking mantra from Ben Shneiderman, which is he tried to synthesize overview first, zoom and filter, and then details on demand. Well hover labels are all about details on demand. So let's say I'm looking at this bar chart on this new data set and in JMP, up to JMP 14, as you hover over a particular bar in your bar chart, it's going to pop up a window with a little bit of textual data about what you're seeing here. Right. So you have labeled information, calculated values, just text, very simple. Gives you your details on demand. But what if you could decorate this with visualizations as well. So for example, if you're looking at that aggregated value, you might want to see the distribution of the values that got that particular calculation. Or you might want to see a breakdown of the values behind the that aggregated value. This is what we're gonna let you know with this new visualization, with this new feature. But on top of that, it's the famous, wait, there is more. This new visualization basically allows you to go on and start the visual exploratory workflow. If you click on it, you can open it up in its own window, which allows you to which can also have its visualization, which you can also click and get even more detail. And so you go down that technique called the drill down and eventually, you might get to a point where you're decorating a particular observation with information you're getting from maybe even Wikipedia in that case. Not going to go into a lot of details. We're going to learn a lot about all that pretty soon. But first, I also wanted to talk a little bit about the design decisions behind the implementation of this feature. Because we wanted to have something that was very easy to use that didn't require programming or, you know, lots of time reading the manual and we knew that would satisfy 80% of the use cases. But for those 20% of really advanced use cases or for those customers that know their JSL and they just want to push the envelope on what JMP can do, we also want to make available, something that you could do through programming. But basically, your top of the context of ??? on those visual elements. So we decided to go with architectural pattern called plumbing and porcelain, and that's something we got to git source code control application, which is basically you have a layer that is very rich and because it's very rich, very complex, which gives you access to all that information and allows you to customize things that are going to happen as far as generating the visualization or what happens when you click on that visualization And on top of that, we built a layer that is more limited, its purpose driven, but it's very, very easy to do and requires no coding at all. So that's the porcelain layer. And that's the one that Lisa is going to be talking about now. Up to you. Lisa. I'm going to stop sharing and Lisa is going to take over. Lisa Grossman Okay so we are going to take a high level look at some of the features and what kind of customization system, make the graphic ??? So, let us first go through some of the basics. So by default when you hover over a data point or an element in your graph. you see information displayed for the X and Y roles used in the graph, as well as any drop down roles such as overlay and if you choose to manually manually label a column in the data table, that will also appear as a hover label. So here we have an example of a label, the expression column tha contains an image. And so we can see that image is then populated in hover label in the back. And to add a graphlet to your hover label, you have the option of selecting some predefined graphlet presets, which you can access via the right mouse menu under hover label. Now these presets have dynamic graph role assignments and derive their roles from variables used in your graph. And presets are also preconfigured to be recursive and that will support drilling down. And for preset graphlets that have categorical columns, you can specify which columns to filter by, by using the next in hierarchy column property that's in your data table. And so now I'm going to demo real quick how to make a graphlet preset. So I'm going to bring up our penguins data table that we're going to be using. And I'm going to open up Graph Builder. And I'm going to make a bar chart here. And then right clicking under hover label, you can see that there is a list of different presets to choose from, but we're going to use histogram for this example. So now that we have set our preset, if you hover over a bar, now you can see that there's a histogram preset that pops up in your hover label. And it's also... it is also filtered based on our bar here, which is the island Biscoe. And the great thing about graphlets is if I hover over this bar, I can see another graphlet. And so now you can easily compare these two graphlets to see the distribution of bill lengths for both the islands Dream and Biscoe. And then you can take it a step further and click on the thumbnail of the graphlet and it will launch a Graph Builder instance in its own window and it's totally interactive so you can open up the control panel of Graph Builder and and customize this graph further. And then as you can see, there's a local data filter already applied to this graph, and it is filtered by Biscoe, which is the thumbnail I launched. So, that is how the graphlets are filtered by. And then one last thing is that if I hover over these these histogram bars, you can see that the histogram graphlet continues on, so that shows how these graphlet presets are pre configured to be recursive. So closing these and returning back to our PowerPoint. So I only showed the example of the histogram preset but there are a number that you can go and play with. So these graphlet presets help us answer the question of what is behind an aggregated visual element. So the scatter plot preset shows you the exact values, whereas the histogram, box plot or heat map presets will show you a distribution of your values. And if you wanted to break down your graph and look at your graph with another category, then you might be interested in using a bar, pie, tree map, or a line preset. And if you'd like to examine your raw data of the table, then you can use the tabulate preset. But if you'd like to further customize your graphlet, you do have the option to do so with paste graphlets. And so paste graphlet, you can easily achieve with three easy steps. So you would first build a graph that you want to use as a graphlet. And we do want to note here that it does not have to be one built from Graph Builder. And then from the little red triangle menu, you can save the script of the graph to your clipboard. And then returning to your base graph or top graph, you can right click and under hover label, there will be a paste graphlet option. And that's really all there is to it. And we want to also note that paste graphlet will have static role assignments and will not be recursive since you are creating these graph lets to drill down one level at a time. But if you'd like to create a visualization with multiple drill downs, then you can, you have the option to do so by nesting paste graphlet operations together, starting from the bottom layer going up to your top or base later. So, and this is what we would consider our Russian doll example, and I can demo how you can achieve that. So we'll pull up our penguins data table again. And we'll start with the Graph Builder and we'll we're going to start building our very top layer for this. So let's go ahead build that bar chart. And then let's go on to build our very second...our second layer. So let's do a pie with species. And then for our very last layer, let's do a scatter plot. OK, so now I have all three layers of our...of what we will use to nest and so I will go and save the script of the scatter plot to my clipboard. And then on the pie, I right click and paste graphlet. And so now when you hover, you can see that the scatter plot is in there and it is filtered by the species in this pie. So I'm going to close this just for clarity and now we can go ahead and do the same thing to the pie, save the script, because it already has the scatter plot embedded. So save that to our clipboard, go over to our bar, do the same thing to paste graphlet. And now we have... we have a workflow that is... that you can click and hover over and you can see all three layers that pop up when you're hovering over this bar. So that's how you would do your nested paste graphlets. And so we do want to point out that there are some JMP analytical platforms that already have pre integrated graphlets available. So these platforms include the functional data explorer, process screening, principal components, and multivariate control charts, and process capabilities. And we want to go ahead and quickly show you an example using the principal components. Lost my mouse. There we go. So I launch our table again and open up principal components. And let's do run this analysis. And if I open up the outlier analysis and hover over one of these points, boom, I can see that these graphlets are already embedded into this platform. So we highly suggest that you go and take a look at these platforms and play around with it and see what you like. And so that was a brief overview of some quick customizations you can do with hover label graphlets and I'm going to pass this presentation back to Nascif so he can move you through the plumbing that goes behind all of these features. Nascif Abousalh-Neto Thank you, Lisa. Okay, let's go back to my screen here. And we... I think I'll just go very quickly over her slides and we're back to plumbing, and, oh my god, what is that? This is the ugly stuff that's under the sink. But that's where you have all the tubing and you can make things really rock, and let me show them by giving a quick demo as well. So here Lisa was showing you the the histogram... the hover label presets that you have available, but you can also click here and launch the hover label editor and this is the guy where you have access to your JSL extension points, which is where you make, which is how those visualizations are created. Basically what happens is that when you hover over, JMP is gone to evaluate the JSL block and capture that as an in a thumbnail and put that thumbnail inside your hover label. That's pretty much, in a nutshell, how it goes. And the presets that you also have available here in the hover label, right, they basically are called generators. So if I click here on my preset and I go all the way down, you can see that it's generating the Graph Builder using the histogram element. That's how it does its trick. Click is a script that is gonna react to when you click on that thumbnail, but by default (and usually people stick with the default), if you don't have anything here, it's just, just gonna launch this on its own window, instead of capturing and scale down a little image. In here on the left you can see two other extension points we haven't really talked much about yet. But we will very soon. So I don't want to get ahead of myself. So, So let's talk about those extension points. So we created not just one but three extension points in JMP 15. And they are, they're going to allow you to edit and do different functionality to different areas of your hover label. So textlets, right, so let's say for example you wanted to give a presentation after you do your analysis, but you want to use the result of that analysis and present it to an executive in your company or maybe we've an end customer that wants a little bit more of detail in in a way that they can read, but you would like make that more distinct. So textlet allows you to do that. But since you're interfacing with data, you also want that to be not a fixed block of text, but something that's dynamic that's based on the data you're hovering over. So to define a textlet, you go back to that hover label editor and you can define JSL variables or not. But if you want it to be dynamic, typically, what you do is you define a variable that's going to have the content that you want to display. And then you're going to decorate that value using HTML notation. So, here is how you can select the font, you can select background colors, foreground colors, you can make it italic, and basically make it as pretty or rich of text as you as you need to. Then the next hover labelextension is the one we call gridlet. And if you remember the original or the current JMP hover label, it's basically a grid of name value pairs. To the left, you have names of your...that would be the equivalent to your column name, and to the right, you have the values which might be just a column cell for a particular row if it's a marked plot. But if it's aggregated like a bar chart, this is going to be a mean or an average medium, something like that. The default content from here, like Lisa said before, is derived at both from the...originally is derived both from whatever labeled columns you have in your data table and also, whatever role assignments you have in your graph. So if it's a bar chart, you have your x, you have your y. You might have an overlay variable and everything that in at some point contributes to the creation of that visual element. Well with gridlets you can now have pretty much total control of that little display. You can remove entries. It's very common that sometimes people don't want to see the very first row, which has the labeles or the number of rows. Some people find that redundant. They can take it out. You can add something that is completely under your control. Basically it's going to evaluate the JSL script to figure out what you want to display there. One use case I found was when someone wanted an aggregated value for a column that was not individualization. Some people call those things hidden columns or hidden calculations. Now you can do that, right, and have an aggregation for the same rows that the rest of that that are being displayed on that visualization. You can rename. We usually add the summary statistic to the left of anything that comes from a y calculated column. If you don't like that, now you can remove it or replace it with something else. And as well...and then you can do details like changing the numeric precision or make text bold or italics or red or... even for example, you can make it red and bold, if the value is above a particular threshold. So you can have something that, as I move over here, if the value is over the average of my data I make it red and bold so I can call attention to that. And that will be automatic for you. And finally, graphlets. We believe that's going to be the most useful and used one. Certainly don't want that to cause more attention because you have a whole image inside your tool tip and we've been seeing examples with data visualizations, but it's an image. So it can be a picture as well. It can be something you're downloading from the internet on the fly by making a web call. That's how I got the image of this little penguin. It's coming straight from Wikipedia. As you hover over, we download it, scale it and and put it here. Or you can, for example, that's a very recent use case, someone had a database of pictures in the laboratory and they have pictures of the samples they were analyzing and they didn't want to put them on the data table because the data table would be too large. Well, now you can just get a column, turn that column into a file name, read from the file name, and boom, display that inside your tool tip. So when you're doing your analysis, you know, exactly, exactly what you're looking at. And just like graph...gridlets, we're talking about clickable content. So again, for example, if I wanted and I showed that when I click on this little thumbnail here, I can open a web page. So you can imagine that even as a way to integrate back with your company. Let's say you have web services that they're supported in your company, and you want to, at some point, maybe click on an image to make a call to kind of register or capture some data. Go talking for a web call to that web service. Now that's something you can do. So I like to call, we talk about drill in and drill down, that would be a drill out. That's basically JMP talking to the outside world using data content from your exploration. So let's look at those things in the little bit more detail. So those those visualizations that we see here inside the hover label, they are basically... that's applied to any visualization. Actually it's a combination of a graph destination and the data subset. So in the Graph Builder, for example, you'll say, I want the bar chart of islands by on my x axis and on my y axis, I want to show the average of the body mass of the penguins on that island. Fine. How do you translate that to a graphlet, right? Well, basically when you select the preset or when you write in your code if you want to do it, but the preset is going to is going to use our graph template. So basically, some of the things are going to be predefined like that. The bar element, although if you're writing it your own, you could even say I want to change my visualization depending on my context. That's totally possible. And you're going to fill that template with a graph roles and values and table data, table metadata. So, for example, let's say I have a preset of doing that categorical drill down. I know it's going to be a bar chart. I don't know what a bar chart is going to be, what's going to be on my y or my x axis. That's going to come from the current state of my baseline graph, for example, I'm looking at island. So I know I want to do a bar chart of another category. So that's when the next in hierarchy and the next column comes into play. I'm making that decision on the fly, based on the information that user is giving me and the graph that's being used. For example, if you look here at the histogram, it was a bar chart of island by body mass. This is a histogram of body mass as well. If I come here to the graph and change this column and then I go back and hover, this guy is going to reflect my new choice. That's this idea of getting my context and having a dynamic graph. The other part of the definition of visualization is the data subset. And we have a very similar pattern, right. We have...LDF is local data filter. So that's a feature that we already had in JMP, of course, right. And basically, I have a template that is filled out from my graph roles here. It's like if it was a bar chart, which means my x variable is going to be a grouping variable of island. I know I wanted to have a local data filter of island and that I want to select this particular value so that it matches the value I was hovering over. This happens both when you're creating the hover label and when you're launching the hover label, but when you create a hover label, this is invisible. We basically create a hidden window to capture that window so you'll never see that guy. But when you launch it, the local data filter is there and as Lisa has shown, you can interact with it and even make changes to that so that you can progress your your, your visual exploration on your own terms. So I've been talking about context, a lot. This is actually something that you should need to develop your own graphlets, you need to be familiar with. We call that hover label execution context. You're going to have information about that in our documentation and it's basically if you remember JSL, it's a local block. We've lots of local variables that we defined for you and those those variables capture all kinds of information that might be useful for someone to find in the graphlet or a gridlet or a textlet. It's available for all of those extension points. So typically, they're going to be variables that start with a nonpercent... Not a nonpercent...I'm sorry. To prevent collisions with your data table column names, so it's kinda like reserved names in a way. But basically, you'll see here that that's that's code that comes from one of our precepts. By the way, that code is available to you through the hover label editor, so you can study and see how it goes. Here we're trying to find a new column. To using our new graph, it's that idea of it being dynamic and to be reactive to the context. And this function is going to look into the data table for that metadata. My...a list of measurement columns. So if the baseline is looking at body mass, body mass is going to be here in this value and at a list of my groupings. So if it was a bar chart of island by body mass, we're going to have islands here. So those are lists of column names. And then we also have any of numeric values, anything that's calculated is going to be available to you. Maybe you want to, like I said, maybe you want to make a logical decision based on the value being above or below the threshold so that you can color a particular line red or make it bold, right. You're going to use values that we provide to you. We also provide something that allow you to go back to the data. In fact, to the data table and fetch data by yourself like the row index of the first row on the list of roles that your visual element discovering, that's available to you as well. And then the other even more data, like for example the where clause that corresponds to that local data filter that you're executing in the context of. And the drill depth, let's say, that allows you to keep track of how many times you have gone on that thumbnail and open a new visualization and so on. So for example, when we're talking about recursive visualizations, every recursion needs an exit condition, right. So here, for example, is how you calculate the exit condition of one of your presets. If I don't have anything more to to show, I return empty, means no visualization. Or if I don't have...if I only show you one value, right, or any of my drill depth is greater than one, meaning I was drilling until I got to a point where just only one value to show in some visualizations doesn't make sense. So I can return empty as well. That's just an example of the kinds of decisions that you can make your code using the hover label execution context. Now, I just wanted to kind of gives you a visual representation of how all those things come together again using the preset example. When you're selecting a preset, you're basically selecting the graph template, which is going to have roles that are going to be fulfilled from the graph roles that are in your hover label execution context. And so that's your data, your graph definition. And that date graph definition is going to be combined with the subset of observations resulting from the, the local data filter that was also created for you behind the scenes, based on the visual element you're hovering over. So when you put those things together, you have a hover label, we have a graphlet inside. And if you click on that graphlet, it launches that same definition in here and it makes the, the local data filter feasible as well. When, like Lisa was saying, this is a fully featured life visualization, not just an image, you can make changes to this guy to continue your exploration. So now we're talking, you should think in terms of, okay, now I have a feature that creates visualizations for me and allow me to create one visualization from another. I'm basically creating a visual workflow. And it's kind of like I have a Google Assistant or an Alexa in JMP, in the sense that I can...JMP is making me go faster by creating, doing visualizations on my behalf. And they might be, also they might be not, just an exploration, right. If you're happy with them, they just keep going. If you're not happy with them, you have two choices and maybe it's easier if I just show it to you. So like I was saying, I come here, I select a preset. Let's say I'm going to get a categoric one bar chart. So that gives me a breakdown on the next level. Right. And if I'm happy with that, that's great. Maybe I can launch this guy. Maybe I can learn to, whoops... Maybe I can launch another one for this feature. At the pie charts, they're more colorful. I think they look better in that particular case. But see, now I can even do things like comparing those two bar charts side by side. And let's...but let's say that if I keep doing that and it isn't a busy chart and I keep creating visualizations, I might end up with lots of windows, right. So that's why we created some modifiers to...(you're not supposed to do that, my friend.) You can just click. That's the default action, it will just open another window. If you alt-click, it launches on the previous last window. And if you control-click it launches in place. What do I mean by that? So, I open this window and I launched to this this graphlet and then I launched to this graphlet. So let's say this is Dream and Biscoe and Dream and Biscoe. Now I want to look at Torgersen as well. Right. And I want to open it. But if I just click it opens on its own window. If I alt-click, (Oh, because that's the last one. I hope. I'm sorry. So let me close this one.) Now if I go back here in I alt-click on this guy. See, it replaced the content of the last window I had open. So this way I can still compare with visualizations, which I think it's a very important scenario. It's a very important usage of this kind of visual workflow. Right. But I can kind of keep things under control. And I don't just have to keep opening window after window. And the maximum, the real top window management feature is if I do a control-click because it replaces the window. And then, then it's a really a real drill down. I'm just going on the same window down and down and now it's like okay, but what if I want to come back. Or if you want to come back and just undo. So you can explore with no fear, not going to lose anything. Even better though, even the windows you launch, they have the baseline graph built in on the bottom of the undo stack. So I can come here and do an undo and I go back to the visualizations that were here before. So I can drill down, come back, branch, you can do all kinds of stuff. And let's remember, that was just with one preset. Let's do something kind of crazy here. We've been talking, we've been looking at very simple visualizations. But this whole idea actually works for pretty much any platform in JMP. So let's say I want to do a fit of x by y. And I want to figure out how...now, I'm starting to do real analytics. How those guys fit within the selection of the species. Right. So I have this nice graph here. So I'm going to do that paste graphlet trick and save it to the clipboard. And I'm going to paste it to the graphlet now. So as you can see, we can use that same idea of creating a context and apply that to my, to my analysis as well. And again, I can click on those guys here and it's going to launch the platform. As long as the platform supports local data filters, (I should have given this ???), this approach works as well. So it's for visualizations but in...since in JMP, we have this spectrum where the analytics also have a visual component, so works with our analytics as well. And I also wanted to show here on that drill down. This is my ??? script. So I have the drill down with presets all the way, and I just wanted to go to the the bottom one where I had the one that I decorated with this little cute penguin. But what I wanted to show you is actually back on the hover label editor. Basically what I'm doing here, I'm reading a small JSL library that I created. I'm going to talk about that soon, right, and now I can use this logic to go and fetch visualizations. In this case I'm fetching it from Wikipedia using a web call. And that visualization comes in and is displayed on my visualization. It's a model dialogue. But also my click script is a little bit different. It's not just launching the guy; it's making a call to this web functionality after getting a URL, using that same library as well. So what exactly is it going to do? So when I click on the guy, it opens a web page with a URL derived from data from my visualization and this can be pretty much anything JSL can do. I just want to give us an example of how this also enables you integration with other systems, even outside of JMP. Maybe I want to start a new process. I don't know. All kinds of possibilities. That I apologize. So So there are two customized...advanced customization examples, I should say, that illustrate how you can use graphlets as a an extensible framework. They're both on the JMP Community, you can click here if you get the slides, but one is called the label viewer. I am sorry. And basically what it does is that when you hover over a particular aggregated graph, it finds all the images on the graph...on the data table associated with those rows and creates one image. And that's something customers have asked for a while. I don't want to see just one guy. I want to see if you have more of them, all of them. Or, if possible, right. So when you actually use this extension, and you click on...actually no, I don't have it installed so... And the wiki reader, which was the other one, is the one I just showed to you. Bbut was what I was saying is that when you click and launch this particular...on this particular image, it launches a small application that allows you to page through the different images in your data table and you have a filter that you can control and all that. This is one that was completely done in JSL on top of this framework. So just to close up, what did we learn today? I hope that you found that it's now very easy to add visualizations, you can visualize your visualizations, if you will. It's very easy to add those data visualization extensions using the porcelain features. You actually have not just richer detail on your thumbnails, but you have a new exploratory visual workflow, which you can customize to meet your needs by using either paste graphlet, if you want to have something easy to do, or you can even use JSL using the hover label editor. We're both very curious to see what you've...how you guys are going to use that in the field. So if you come with some interesting examples, please call us back. Send us a screenshot in the JMP Community and let us know. That's all we have today. Thank you very much. And when we give this presentation, we're going to be here for Q&A. So, thank you.  
Jeremy Ash, JMP Analytics Software Tester, JMP   The Model Driven Multivariate Control Chart (MDMVCC) platform enables users to build control charts based on PCA or PLS models. These can be used for fault detection and diagnosis of high dimensional data sets. We demonstrate MDMVCC monitoring of a PLS model using the simulation of a real world industrial chemical process — the Tennessee Eastman Process. During the simulation, quality and process variables are measured as a chemical reactor produces liquid products from gaseous reactants. We demonstrate fault diagnosis in an offline setting. This often involves switching between multivariate control charts, univariate control charts, and diagnostic plots. MDMVCC provides a user-friendly way to move between these plots. Next, we demonstrate how MDMVCC can perform online monitoring by connecting JMP to an external database. Measuring product quality variables often involves a time delay before measurements are available, which can delay fault detection substantially. When MDMVCC monitors a PLS model, the variation of product quality variables is monitored as a function of process variables. Since process variables are often more readily available, this can aide in the early detection of faults. Example Files Download and extract streaming_example.zip.  There is a README file with some additional setup instructions that you will need to perform before following along with the example in the video.  There are also additional fault diagnosis examples provided. Message me on the community if you find any issues or have any questions.       Auto-generated transcript...   Speaker Transcript Jeremy Ash Hello, I'm Jeremy ash. I'm a statistician in jump R amp D. My job primarily consists of testing the multivariate statistics platforms and jump but   I also help research and evaluate methodology and today I'm going to be analyzing the Tennessee Eastman process using some statistical process control methods and jump.   I'm going to be paying particular attention to the model driven multivariate control chart platform, which is a new addition to jump and I'm really excited about this platform and these data provided a new opportunity to showcase some of its features.   First, I'm assuming some knowledge of statistical process control in this talk.   The main thing you need to know about is control charts. If you're not familiar with these. These are charts used to monitor complex industrial systems to determine when they deviate from normal operating conditions.   I'm not gonna have much time to go into the methodology and model driven multivariate control chart. So I'll refer to these other great talks that are freely available.   For more details. I should also mention that Jim finding was that primary developer of the model driven multivariate control chart and in collaboration with Chris Got Walt and Tanya Malden I were testers.   So the focus of this talk will be using multivariate control charts to monitor a real world chemical process.   Another novel aspect of this talk will be using control charts for online process monitoring this means we'll be monitoring data continuously as it's added to a database and texting faults in real time.   So I'm going to start with the obligatory slide on the advantages of multivariate control charts. So why not use University control charts there. There are a number of excellent options and jump.   University control charts are excellent tools for analyzing a few variables at a time. However, quality control data sets are often high dimensional   And the number of charts that you need to look at can quickly become overwhelming. So multivariate control charts summarize a high dimensional process. And just a few charts and that's a key advantage.   But that's not to say that university control charts aren't useful in this setting, you'll see throughout the talk that fault diagnosis often involves switching between multivariate in University of control charts.   Multivariate control charts, give you a sense of the overall health of a process well University control charts allow you to   Look at specific aspects. And so the information is complimentary and one of the main goals of model driven multivariate control chart was to provide some tools that make it easy to switch between those two types of charts.   One disadvantage of the university control chart is that observations can appear to be in control when they're actually out of control in the multivariate sense. So I have to   Control our IR charts for oil and density and these two observations in red are in control, but oil and density are highly correlated. And these observations are outliers in the multivariate sense in particular observation 51 severely violates the correlation structure.   So multivariate control charts can pick up on these types of outliers. When University control charts can't   model driven multivariate control chart uses projection methods to construct its control charts. I'm going to start by explaining PCA because it's easy to build up from there.   PCA reduces dimensionality of your process variables by projecting into a low dimensional space.   This is shown in the in the picture to the right we have p process variables and and observations and we want to reduce the dimensionality of the process to a were a as much less than p and   To do this we use this P loading matrix, which provides the coefficients for linear combinations of our X variables which give the score variables. The shown and equations on the left.   tee times P will give you predicted values for your process variables with the low dimensional representation. And there's some prediction air and your score variables are selected.   In a way that minimizes this squared prediction air. Another way to think about it is, you're maximizing the amount of variance explained x   Pls is more suitable when you have a set of process variables and a set of quality variables and you really want to ensure that the quality variables are kept in control, but these variables are often expensive or time consuming to collect   At planet can be making out of control quality for a long time before fault is detected, so   Pls models allow you to monitor your quality variables as a function of your process variables. And you can see here that pls will find score variables that maximize the variance explained in the y variables.   The process variables are often cheaper and more readily available. So pls models can allow you to detect quality faults early and can make process monitoring cheaper.   So from here on out. I'm just going to focus on pls models because that's that's more appropriate for our example.   So pls partitions your data into two components. The first component is your model component. This gives you the predicted values.   Another way to think about this as your data has been projected into a model plane defined by your score variables and t squared charts will monitor variation in this model plane.   The second component is your error component. This is the distance between your original data and that predicted data and squared prediction error charts are sp charts will monitor   Variation in this component   We also provide an alternative distance to model x plane, this is just a normalized version of sp.   The last concept that's important to understand for the demo is the distinction between historical and current data.   historical data typically collected when the process is known to be in control. These data are used to build the PLS model and define   Normal process variation. And this allows a control limit to be obtained current data are assigned scores based on the model, but are independent of the model.   Another way to think about this is that we have a training and a test set, and the t squared control limit is lower for the training data because we expect lower variability for   Observations used to train the model, whereas there's greater variability and t squared. When the model generalized is to a test set. And fortunately, there's some theory that's been worked out for the   Variants of T square that allows us to obtain control limits based on some distributional assumptions.   In the demo will be monitoring the Tennessee Eastman process. I'm going to present a short introduction to these data.   This is a simulation of a chemical process developed by downs and Bogle to chemists at Eastman Chemical and it was originally written in Fortran, but there are rappers for it in MATLAB and Python now.   The simulation was based on a real industrial process, but it was manipulated to protect proprietary information.   The simulation processes. The, the production of to liquids.   By gassing reactants and F is a byproduct that will need to be siphoned off from the desired product.   The two season processes pervasive in the in the literature on benchmarking multivariate process control methods.   So this is the process diagram. It looks complicated, but it's really not that bad. So I'm going to walk you through it.   The gaseous reactants ad and he are flowing into the reactor here, the reaction occurs and product leaves as a gas. It's been cooled and condensed into a liquid and the condenser.   Then we have a vapor liquid separator that will remove any remaining vapor and recycle it back to the reactor through the compressor and there's also a purge stream here that will   Vent byproduct and an art chemical to prevent it from accumulating and then the liquid product will be pumped through a stripper where the remaining reactants are stripped off and the final purified product leaves here in the exit stream.   The first set of variables that are being monitored are the manipulated variables. These look like bow ties and the diagram.   Think they're actually meant to be valves and the manipulative variables, mostly control the flow rate through different streams of the process.   These variables can be set to specific values within limits and have some Gaussian noise and the manipulative variables can be sampled at any rate, we're using a default three minutes sampling in   Some examples of the manipulative variables are the flow rate of the reactants into the reactor   The flow rate of steam into the stripper.   And the flow of coolant into the reactor   The next set of variables are measurement variables. These are shown as circles in the diagram and they're also sampled in three minute intervals and the difference is that the measurement variables can't be manipulated in the simulation.   Our quality variables will be percent composition of to liquid products you can see   The analyzer measuring the composition here.   These variables are collected with a considerable time delay so   We're looking at the product in the stream because   These variables can be measured more readily than the product leaving in the exit stream. And we'll also be building a pls model to monitor   monitor our quality variables by means of our process variables which have substantial substantially less delay in a faster sampling rate.   Okay, so that's an a background on the data. In total there are 33 process variables into quality variables.   The process of collecting the variables is simulated with a series of differential equations. So this is just a simulation. But you can see that a considerable amount of care went into model modeling. This is a real world process.   So here's an overview of the demo, I'm about to show you will collect data on our process and then store these data in a database.   I wanted to have an example that was easy to share. So I'll be using a sequel light database, but this workflow is relevant to most types of databases.   Most databases support odd see connections once jump connects to the database it can periodically check for new observations and update the jump table as they come in.   And then if we have a model driven multivariate control chart report open with automatic re calc turned on. We have a mechanism for updating the control charts as new data come in.   And the whole process of adding data to a database will likely be going on on a separate computer from the computer doing the monitoring.   So I have two sessions of jump open to emulate this both sessions have their own journal in the materials are provided on the Community.   And the first session will add simulated data to the database and it's called the streaming session and the next session will update reports as they come into the database and I'm calling that the monitoring session.   One thing I really liked about the downs and Vogel paper was that they didn't provide a single metric to evaluate the control of the process. I have a quote from the paper here. We felt like   We felt that the trade offs among possible control strategies and techniques involved, much more than a mathematical expression.   So here's some of the goals they listed in their paper which are relevant to our problem maintain the process variables that desired values minimize variability of the product quality during disturbances and recover quickly and smoothly from disturbances.   So we will assess how well our process achieve these goals, using our monitoring methods.   Okay.   So to start off, I'm in the monitoring session journal and I'll show you our first data sent the data table contains all the variables I introduced earlier, the first set are the measurement variables. The next set our composition variables. And then the last set are the manipulated variables.   And the first script attached here will fit a pls model it excludes the last hundred rose is a test set.   And just as a reminder, this model is predicting our two product composition variables as a function of our process variables but pls model or PLS is not the focus of the talk. So I've already fit the model and output score columns here.   And if we look at the column properties. You can see that there's a MD MCC historical statistics property that contains all the information   On your model that you need to construct the multivariate control charts. One of the reasons why monitoring multivariate control chart was designed this way was   Imagine you're a statistician, and you want to share your model with an engineer, so they can construct control charts. All you need to do is provide the data table with these formula columns. You don't need to share all the gory details of how you fit your model.   So next I will use the score columns to create our control turn   On the left, I have to control charts t squared and SPE there 860 observations that were used to estimate the model. And these are labeled as historical and then I have 100 observations that were held out as a test set.   And you can see in the limit summaries down here that I performed a bond for only correction for multiple testing.   As based on the historical data. I did this up here in the red triangle menu, you can set the alpha level, anything you want and   I did this correction, because the data is known to be a normal operating conditions. So, we expect no observations to be out of control and after this multiplicity adjustment, there are zero false alarms.   On the right or the contribution proportion heat maps. These indicate how much each variable contributes to the outer control signal each observation is on the Y axis and the contributions are expressed as a proportion   And you can see in both of these plots that the contributions are spread pretty evenly across the variables.   And at the bottom. I have a score plant.   Right now we're just plotting the first score dimension versus the second score dimension, but you can look at any combination of the score dimensions using these drop down menus, or this arrow.   Okay, so we're pretty oriented to the report, I'm going to switch over to the monitoring session.   Which will stream data into the database.   In order to do anything for this example, you'll need to have a sequel light odd see driver installed. It's easy to do. You can just follow this link here.   And I don't have time to talk about this but I created the sequel light database. I'll be using and jump I have instructions on how to do this and how to connect jump to the database on my community webpage   This is example might be helpful if you want to try this out on date of your own.   I've already created a connection to this database.   And I've shared the database on the community. So I'm going to take a peek at the data tables in query builder.   I can do that table snapshot   The first data set is the historical data I I've used this to construct a pls model, there are 960 observations that are in control.   The next data table is a monitoring data table this it is just contains the historical data at first, but I'll gradually add new data to this and this is what our multivariate control chart will be monitoring.   And then I've simulated the new data already and added it to this data table here and see it starts at timestamp 961   And there's another 960 observations, but I've introduced a fault at some time point   And I wanted to have something easy to share. So I'm not going to run my simulation script and add the database that way.   I'm just going to take observations from this new data table and move them over to the monitoring data table using some JSON with sequel statements.   And this is just a simple example emulating the process of new data coming into a database, somehow, you might not actually do this with jump. But this is an opportunity to show how you can do it with ASL.   Next, I'll show you the script will use to stream in the data.   This is a simple script. So I'm just going to walk you through it real quick.   The first set of commands will open the new data table from the sequel light database, it opens up in the background. So I have to deal with the window, and then I'm going to take pieces from this new data table and   move them to the monitoring data table I'm calling the pieces bites and the BITE SIZES 20   And then this will create a database connection which will allow me to send the database SQL statements. And then this last bit of code will interactively construct sequel statements that insert new data into the monitoring data. So I'm going to initialize   Okay, and show you the first iteration of this loop.   So this is just a simple   SQL statement insert into statement that inserts the first 20 observations.   Comment that outset runs faster. And there's a wait statement down here. This will just slow down the stream.   So that we have enough time to see the progression of the data and the control charts by didn't have this this streaming example would just be over too quick.   Okay, so I'm going to   Switch back to the monitoring session and show you some scripts that will update the report.   Move this over to the right. So you can see the report and the scripts at the same time.   So,   This read from monitoring data script is a simple script that checks the database every point two seconds and adds new data to the jump table. And since the report has automatic recount turned on.   The report will update whenever new data are added. And I should add that realistically, you probably wouldn't use a script that just integrates like this, you probably use Task Scheduler and windows are automated and Max better schedule schedule the runs   And then the next script here.   will push the report to jump public whenever the report is updated.   I was really excited that this is possible and jump.   It enables any computer with a web browser to view updates to the control chart. You can even view the report on your smartphone. So this makes it easy to share results across organizations. You can also use jump live if you wanted the reports to be on a restricted server.   And then the script will recreate the historical data and the data table in case you want to run the example multiple times.   Okay, so let's run the streaming script.   And look at how the report updates.   You can see the data is in control at first, but then a fault is introduced, there's a large out of control signal, but there's a plant wide control system that's been implemented and the simulation, which brings the system to a new equilibrium   I give this a second to finish.   And now that I've updated the control chart. I'm going to push the results to jump public   On my jump public page I have at first the control chart with the data and control at the beginning.   And this should be updated with the addition of the data.   So if we zoom in on the when the process first went out of control.   Your Jeremy Ash It looks like that was sample 1125 I'm going to color that   And label it.   So that it shows up in other plots and then   In the SP plot it looks like this observation is still in control.   And what chart will catch faults earlier depends on your model. And how many factors, you've chosen   We can also zoom in on   That time point in the contribution plot. And you can see when the process. First goes out of control. There's a large number of variables that are contributing to the out of control signal. But then when the system reaches a new equilibrium, only two variables have large contributions.   So I'm going to remove these heat maps so that I'm more room in the diagnostic section.   And to make everything pretty pretty large so that the text would show up on your screen.   If I hover over the first point that's out of control. You can get a peek at the top 10 contributing variables.   This is great for quickly identifying what variables are contributing the most to the out of control signal. I can also click on that plot and appended to the diagnostic section and   You can see that there's a large number of variables that are contributing to the out of control signal.   zoom in here a little bit.   So if one of the bars is red. This means that variable is out of control.   In a universal control chart. And you can see this by hovering over the bars.   I'm gonna pan, a couple of those   And these graph, let's our IR charts for the individual variables with three sigma control limits.   You'd see for the stripper pressure variable. The observation is out of control in the university control chart, but the variables eventually brought back under control by our control system. And that's true for   Most of the   Large contributing variables and also show you one of the variables where observation is in control.   So once the control system responds many variables are brought back under control and the process reaches   A new equilibrium   But there's obviously a shift in the process. So to identify the variables that are contributing to the shift. And one thing you can look at is a main contribution.   Plot   If I sort this and look at   The variables that are most contributing. It looks like just two variables have large contributions and both of these are measuring the flow rate of react in a in a stream one which is coming into the reactor   And these are measuring essentially the same thing except one is a measurement variable and one's a manipulated variable. And you can see   In the university control chart that there's a large step change in the flow rate.   This one as well. And this is the step change that I programmed in the simulation. So these contributions allow us to quickly identify the root cause.   So I'm going to present a few other alternate methods to identify the same cause of the shift. And the reason is that in real data.   Process shifts are often more subtle and some of the tools may be more useful and identifying them than others and will consistently arrive at the same conclusion with these alternate methods. So it'll show some of the ways that these methods are connected   Down here, I have a score plant which can provide supplementary information about shifts in the t squared plant.   It's more limited in its ability to capture high dimensional shifts, because only two dimensions of the model are visualized at a time, however, we can provide a more intuitive visualization of the process as it visuals visualizes it in a low dimensional representation   And in fact, one of the main reasons why multivariate control charts are split into t squared and SPE in the first place is that it provides enough dimensionality reduction to easily visualize the process and the scatter plot.   So we want to identify the variables that are   Causing the shift. So I'm going to, I'm going to color the points before and after the shift.   So that they show up in the score plot.   Typically, when we look through all combinations of the six factors, but that's a lot of score plots to look through   So something that's very handy is the ability to cycle through all combinations quickly with this arrow down here and we can look through the factor combinations and find one where there's large separation.   And if we wanted to identify where the shift first occurred in the score plots, you can connect the dots and see that the shift occurred around 1125 again.   Another useful tool. If you want to identify   Score dimensions, where an observation shows the largest separation from the historical data and you don't want to look through all the score plots is the normalized score plot. So I'm going to select a point after the shift and look at the normalized score plot.   I'm actually going to choose another one.   Okay. Jeremy Ash Because I want to look at dimensions, five, and six. So the   These plots show the magnitude of the score and each dimension normalized, so that the dimensions are on the same scale. And since the mean of the historical data is is that zero for each score to mention the dimensions with the largest magnitude will show the largest separation.   Between the selected point and the historical data. So it looks like here, the dimensions, five and six show the greatest separation and   I'm going to move to those   So there's large separation here between our   Shifted data and the historical data and square plot visualization is can also be more interpreted well because you can use the variable loadings to assign meaning to the factors.   And   Here I have   We have too many variables to see all the labels for them.   Loading vectors, but you can hover over and see them. And you can see, if I look in the direction of the shift that the two variables that were the cause show up there as well.   We can also explore differences between sub groups in the process with the group comparisons to do that I'll select all the points before the shift in call that the reference group and everything after in call that the group I'm comparing to the reference   These   And this contribution plot will will give me the variables that are contributing the most to the difference between these two groups. And you can see that this also identifies the variables that caused the shift.   The group comparisons tool is particularly useful when there's multiple shifts in a score plot are when you can see more than two distinct subgroups in your data.   In our case, as, as we're comparing a group in our current data to the historical data. We could also just select the data after the shift and look at a main contribution score plot.   And this will give us   The average contributions of each variable to the scores in the orange group. And since large scores indicate large difference from the historical data. These contribution plots can also identify the cause.   These are using the same formula is the contribution formula for t squared. But now we're just using the, the two factors from the score plot.   Okay, I'm gonna find my PowerPoint again.   So real quick, I'm going to summarize the key features of the model driven multi variant control chart that were shown in the demo.   The platform is capable of performing both online fault detection and offline fault diagnosis. There are many methods, providing the platform for drilling down to the root cause of the faults.   I'm showing you. Here's some plots from the popular book fault detection and diagnosis in industrial systems throughout the book authors.   Demonstrate how one needs to use multivariate and universal control charts side by side to get a sense of what's going on in the process.   And one particularly useful feature and model driven multivariate control chart is how interactive and user friendly. It is to switch between these types of charts.   So that's my talk here. Here's my email. If you have any further questions, and thanks to everyone who tuned in to watch this.
Meijian Guan, JMP Research Statistician Developer, SAS   Single-cell RNA-sequencing technology (scRNA-seq) provides a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment. Recently, it has been used to combat COVID-19 by characterizing transcriptional changes in individual immune cells. However, it also poses new challenges in data visualization and analysis due to its high dimensionality, sparsity, and varying heterogeneity across cell populations. JMP Project is a new way to organize data tables, reports, scripts as well as external files. In this presentation, I will show how to create an integrated Basic scRNA-seq workflow using JMP Project that performs standard exploration on a scRNA-seq data set. It first selects a set of high variable genes using a dispersion or a variance-stabilizing transformation (VST) method. Then it further reduces data dimension and sparsity by performing a sparse SVD analysis. It then generates an interactive report that consists of data overview, variable gene plot, hierarchical clustering, feature importance screening, and a dynamic violin plot on individual gene expression levels. In addition, it utilizes the R integration feature in JMP to perform t-SNE or UMAP visualizations on the cell populations if appropriate R packages are installed.     Auto-generated transcript...   Speaker Transcript Meijian Guan All right. Um, hi, everyone. Thank you so much for attending this presentation. I'm so happy that I have this opportunity to share the work I have   have been doing with JMP Life Science group and SAS Institute. So today's topic is going to be building a single-cell RNA-sequencing workflow with JMP Project. So this is a new feature I developed for JMP Genomics 10. If you don't know what is JMP Genomics, I will give you a brief   overview about it and the JMP project is a new feature, released on 14 and it's very nice tool can help you to organize your reports. So we took advantage of this new platform and organized a single-cell RNA-sequencing   workflow into it. So first of all, I just want to give you a little bit background about JMP, JMP Genomics is   one of the products from JMP family is built on top of SAS and JMP Pro. So it's taking advantage of both products which makes a very powerful analytical tool.   So it's designed for genomic data, so it can read in different types of genomic data, it can do preprocessing, it can handle next generation sequencing that analysis.   It is really good at differential gene expression and biomarker discovery, and many scientists using it for crop and livestock breeding. So it's a very powerful tool. I encourage everyone to check it out if you are doing anything related to genomics.   And next thing I want to share with you is the single-cell RNA sequencing. Many of you may not be very familiar with it.   So this is a relatively new technology used to examine that on a level from individual cells.   And comparing to the traditional RNA sequencing technology which is survey, the average expression level of a group of cells.   This, this new technology provides a higher resolution of cellular differences and it gives you a better understanding of the function of the individual cell in the context of its micro environment.   And it can help to do a lot of stuff like uncover new and rare cell populations, track trajectories of cell development, and identify differentially expresed genes between cell types. So it has very wide application.   One application recently is scientists using it to combat Covid 19 so it because it can be used to   characterizing transcriptional changes in immune cells and how to develop the vaccines and treatment. Also in addition to that, it's widely used in cancer research and widely used in immunology and in many other research fields. Um, so it's very powerful tool, but   it does have some challenges to analyze that data, so that's why we put together this workflow.   Just wanted to give you an overview of the top line of the single-cell RNA sequencing. So the first thing you need is to get a sample,   either from human or from animals. It could be a tumor or lab sample. And then you can isolate those samples into individual cells.   So after you isolate, you can do sequencing on every individual cell for all the genes you have. For example, in humans, we have about 30,000 genes. So the final product will look like this in our read count table. We have genes...30,000 genes in rows and we have   about sometimes half million cells as columns. So as you can see, meeting these very large data set has very high dimensions. Also you can notice the zeros in the table because   Because of the technical or biological limitations, there's no way we can detect every single gene in every single cell. So it's not uncommon to see 90% of cells actually are   zeros. So it's very sparse. Sparsity is another challenge when you analyze single cell RNA sequencing data.   But after you do preprocessing, cleaning up, and do dimension reduction, you can apply regular, like clustering and principal components, differential gene expression analysis on this data. So those will be mentioned in my workflow.   And I already mentioned this out that that I noticed challenges, including high dimensionality, high sparsity, and also there are varying heterogeneity across cell populations.   Technical noises and reproducibility, since there are so many different sequencing protocols so many different analytical packages.   In R or Python or other tools, it's very hard for you to follow exact steps to analyze your data. And if you mixed up the steps and didn't do things in correct order,   you may not be able to get a reproducible results. So that's one of the problems that we tried to solve here.   Just want to show you an example of single-cell RNA-sequencing data. This data will be used in my demonstration and it's a reduced blood sample data or ppm. See that I said we have cells in rows and genes in columns so it's   about 8000 columns and 100   rows, which would mean cells. And you can see those zeros, pretty much everywhere. I, I believe it's more than 90% of sparsity in this specific data set.   So what's in our new single-cell RNA-sequencing workflow in JMP Genomics 10? So for this workflow, we tried to build it   for those people who do not have very good technical background or not, do not have time to learn how to code and all those statistics. So in this workflow we put those steps in the right order for users to automatically execute the other steps in the workflow. And we also   provide a very interactive reports to help users navigate with us and change the parameters and check outs different selections.   So what's in this workflow, including data import progress, preprocessing and we have a variable gene selection method, which is the backbone of this workflow actually.   So it for variable gene selection that the goal for this method is to reduce a dimension of the genes.   Because for humans sample, we have 30,000 genes and not all of them are informative. So we try to pick the most informative ones.   dispersion method and   variability stabilizing transformation method based on lowest regression. So, I will not go into the details, but these two methods are widely used the research community and I'm pretty happy that we were able to reproduce them.   And we also apply sparse SVD to further reduce the dimensions. And so we also applied hierarchical clustering and a k means clustering.   We have feature importance screening using a boosting forest method in JMP and if you have R packages installed, we are directly call out to T-SNE and UMAP visualization   which is very popular using a single-cell RNA-sequencing analysis. And we also provided some dynamic visualizations including violin plot, ridgeline plot, dop plot; we also do differential gene expression. And so all the reports will be organized in a very integrated reports with JMP project.   So next I will do a demo. There are two goals in this demo. First one is to classify the cell populations in this PBMC data set. We try to find what are the cell types in this data set. The second goal is just to identify differentially expressed jeans across subtypes and conditions.   So first of all, let's go to JMP Genomics starter. So JMP Genomics interface looks quite different from regular JMP but   it's pretty easy to navigate. If you want to find the workflows basics   and basic single-cell RNA-sequencing workflow lives here, you click that you can bring up this interface. So the interface is pretty intuitive. I'll say you just provide a data set.   And you specify the QC options. What, what kinds of genes or cells do you want to remove for your analysis and variable gene selections. Which one method that you want to use, right. If you select a VST, you can also specify the number of genes you want to keep, 2000 or 3000   And the clustering options, right, how many principal components you want to use for the clustering and   either you wants hierarchical or k means clustering algorithms. And the more options, we have marker genes   to help you to add a list of marker genes you want to use to identify the cell populations, which is very handy tool here. And you can launch ANOVA and differential expression analysis. So this is a separate report.   I will not discuss this in this talk. So another thing we had is experiment example, right. If you add that basically you can provide any information related to start design like treatment information,   sex information. So this is   the simulated data here. I would just want to show you how we're gonna compare the gene expression levels and different measurements between groups.   And finally, we have embedding options which can call out to t-SNE or UMAP R packages if you have them installed. You can change different parameters for this to our algorithms.   So after you specify all those options, just go to run and then you have the report that looks like this one. So this is a   tabular report. There are a total of seven tabs in this report. I organized them in the order that you want to   how many genes in   in the cells, how many read counts or what's the percentage of mitochondria gene counts in your data and the correlations between this three measurements.   And we, you notice this left side, we have the action box. You can expand it and find options in this, in this box you can do many things with it.   In in this tab, specifically, you can split the graph, based on the conditions you provided in the experimental design file. For example, we can do a treatment and we split   Drug1, Drug2, placebo. Then you can see if there's any difference between different groups, right, and we can unsplit if you want to go back to our original plot.   And the second tab is variable gene selection, which is the backbone of this workflow. The red dots mean those genes I selected for subsequent analysis and these   gray dots are the genes that will be discarded in analysis. And if you expand action box, you can see, since we use the VST, we specified 2000 genes   in this analysis. But if you change your mind, you can, you can, whenever you change your mind, you can type in a different number of genes and then click OK. So all the tabs will be refreshed as based on this new number.   So after you have a list of variable genes, what you are going to do is to further reduce the dimensions by performing sparse SVD analysis, which is equivalent to principal component analysis.   So after you apply SVD analysis, you can plot out the top two SVDs or principal components. Try to check the global structure of your data set.   So in this case, we can see there are two big groups in this data set, which is interesting. And also we provide a 3D plot to help you to further explore   your data. Sometimes there's, there are some insights that you cannot, that you cannot identify in a 2D plot; 3D plots sometimes can really provide additional value.   And we have those SVDs, depending on how many you selected (20 or 30), you can use them to   perform clustering. In this case we selected hierarchical clustering and we find nine clusters in your data set.   In addition to dendogram, we also offer a constellation plot   which I really like because this plot is similar to t-SNE or UMAP. It gives you a better idea about the distance between different groups, right.   For example, there are three groups, big groups, three clusters   Kind of distinct from other groups. And if you want to see where are they in the global structure here, this is the top three clusters I detect. Look at 3D plots, we can see   all those highlighted ones are over here. And then go to 2D again, this is one of two big clusters, you know, that I said I highlighted so it's interactivity really help you to   observe and visualize your data in multiple ways. And we also provide a parallel plot to help you to further identify the different patterns across different groups.   And the next tab is embedding, which means t-SNE and UMAP parts if you have R packaging installed. I will   call out to R, run the analysis, and bring back the data and visualize it in JMP. So here is a t-SNE plot. We have nine clusters very nicely being separated.   On the bottom is exactly the same plot but this time I colored them with the marker genes you provided; we have 14 marker genes. So you can, using this feature, switch to click through   to see where these genes are expressed, right. For example, there's a GNLY gene, highly expressed in this little cluster and we are wondering, what's our data? We select and go back and now we see all, most of them are from cluster nine, cluster eight.   So GNLY is a gene for NK cells. This is a marker gene for NK cells. So now we have idea about what a group of this cell is, right.   And also we have action buttons here, help you to do more things. If you want to switch to UMAP, if you prefer that, you can do it. Now the plots associate to UMAP.   Exact same thing but UMAP does give you a little better, a little bit better separation and it can preserve more global structure in the visualization. And also we can provide some ways help you to remove the cells that might be   contaminated or have some quality problems. For example, we don't like a group of cells here, we can remove them from the visualization. Make it cleaner, but you can always bring them back.   And again, we can split the plots, which is a split graph button. This time, we can split by the gender, we split by female, male. Right, we can compare the gene expression level across the gender groups, which is pretty useful sometimes.   And we are split.   And the next tab is providing you more visualization tools to visualize gene expression levels in the nine   groups, nine group cells, right. So the first part is called a violin plot. Again, we have a feature switcher help you to go through   all those genes in different clusters, right. Now you can see, depending on how, how tall the part of the graph is and what its density is.   You can clearly see where those genes are highly expressed. For example, again, we give example I'm using this gene, GNLY, you can see it's highly expressed in cluster eight.   And in the middle, the second plot we are providing you is ridgeline plot. A ridgeline plot is organizing the   clusters on the Y axis and the gene expression level at X axis. But it's basically providing you a similar thing depending on what you like.   For example, GNLY, again, we can see cluster eight have GNLY highly expressed by knot or other clusters.   And the bottom we have another plot called dot plot. This is the new plot we just added to this report. In addition to showing you that gene expression levels, dot plot can also show you the percentage of the cells expressing that gene. For example, take a look at this place,   PPBP gene. And we can see this cluster, in cluster seven we have had, we can see 100% of cells in cluster seven expressing this PPBP gene. So this gene is the marker gene for   PPBP cells actually. So now it's very clear that cluster seven is is one type of blood cell, which is a PPBP cell and they take a look at other, like for example, cluster two.   There's only 12% of the cells expressing this gene. So there might be some contamination but this group of cells is definitely not PPBP cells. So this plot just showing you   the expression level and expression percentage in each cluster which offers additional information in the plot.   So next tab is also very useful. It's called feature screening. What I did was a fading boosting forest algorithm and then using that genes to predict the clusters. So the most important genes which contributed to the separation of the cells are ranked in this table.   And the correct way to to view these genes is to open this action box. You select this top maybe top 35 genes you want to visualize. You click OK.   So the next tab will show you only 35 genes you selected. So those are the genes, mostly informative. Right. They, they can explain   why those different groups of cells are separated. So again, we just switch or you can click through and try to see the patterns. And then if you notice, there lot of genes LYZ,   CST3 and NKG7. Those are already in the marker genes or were provided, which means this feature screening   method is really successful to pick up those most important genes in your data set. And another thing, you can do visualization is through GTEx database. The GTEx is a tissue-specific database.   Tell you what genes expressed in which tissue in your human body. So we can directly send the gene list to the database. You just click OK. We will open the   website and GTEx website will provide you a heat map, right, so with the top 35 genes. Now you can see where I've expressed in those two human tissues, organs, which is very convenient to see additional information.   So now we've those marker genes being used, you probably can identify what group of cells are they, right. So there's one   function here is called a recode. What it does is, you open it, now you can recode those numbers into actual cell names, right. For example, eight we already know it's NK, it's NK cells. And we can do...I already have names for every single one of them. So I just type in   Those   Monocite.   Two is DC cells actually; three is FCGR3A+ monocite.   These are Naive CD+ T cells.   Group five Memory CD4+ T cells.   And CD8+ T   Meijian Guan Meijian Guan for group six; seven is PPBP, as we already saw and the ninth is B cells.   So with those recode we click recode. Now since all the plots and the tabs are connected, now you can find all the numbers have changed into actual cell names. So it's just help you to explore your data in   easily, right. You can know where what those cells are and you can again do some exploration on and in your plots. And again, including this clustering plots, you know, see the custom name has been changed into the actual cell names. Um, so   That's it for today's topic. And if you have any questions, you can send me an email or leave a message on the JMP Community. Thank you so much for your time.
John Cromer, Sr. Research Statistician Developer, JMP   While the value of a good visualization in summarizing research results is difficult to overstate, selection of the right medium for sharing with colleagues, industry peers and the greater community is equally important. In this presentation, we will walk through the spectrum of formats used for disseminating data, results and visualizations, and discuss the benefits and limitations of each. A brief overview of JMP Live features sets the stage for an exciting array of potential applications. We will demonstrate how to publish JMP graphics to JMP Live using the rich interactive interface and scripting methods, providing examples and guidance for choosing the best approach. The presentation culminates with a showcase of a custom JMP Live publishing interface for JMP Clinical results, including the considerations made in designing the dialog, the mechanics of the publishing framework, the structure of JMP Live reports and their relationship to the JMP Clinical client reports and a discussion of potential consumption patterns for published reviews.     Auto-generated transcript...   Speaker Transcript John Cromer Hello everyone, Today I'd like to talk about two powerful products that extend JMP in exciting ways. One of them, JMP Clinical, offers rich visualization, analytical and data management capabilities for ensuring clinical trial safety and efficacy. The other, JMP Live, extends these visualizations to a secure and convenient platform that allows for a wider group of users to interact with them from a web browser. As data analysis and visualization becomes increasingly collaborative, it is important that both creating and sharing is easy. By the end of this talk, you'll see just how easy it is. First, I'd like to introduce the term collaborative visualization. Isenberg, et al., defines it as the shared use of computer supported interactive visual representations of data on more than one person with a common goal of contribution to join information processing activities. As I'll later demonstrate, this definition captures the essence of what JMP, JMP Clinical and JMP Live can provide. When thinking about the various situations in which collaborative visualization occurs, it is useful to consult the Space Time Matrix. In the upper left of this matrix, we have the traditional model of classroom learning and office meetings, with all participants at the same place at the same time. Next in the upper right, we have participants at different places interacting with the visualization at the same time. In the lower left, we have participants interacting at different times at the same location, such as in the case of shift workers. And finally, in the lower right, we have flexibility in both space and time with participants potentially located anywhere around the globe and interacting with the visualization at any time of day. So JMP Live can facilitate this scenario. A second way to slice through the modes of collaborative visualization is by thinking about the necessary level of engagement for participants. When simply browsing a few high-level graphs or tables, sometimes simple viewing can be sufficient. But with more complex graphics and for those in which the data connections have been preserved between the graphs and underlying data tables, users can greatly benefit by also having the ability to interact with and explore the data. This may include choosing a different column of interest, selecting different levels in a data filter and exposing detailed data point hover text. Finally, authors who create visualizations often have a need to share them with others and by necessity will also have the ability to view, interact with and explore the data. and JMP and JMP Clinical for authors who require all abilities. A third way to think about formats and solutions is by the interactivity spectrum. Static reports, such as PDFs, are perhaps the simplest and most portable, but generally, the least interactive Interactive HTML, also known as HTML5, offers responsive graphics and hover text. JMP Live is built on an HTML5 foundation, but also offer server-side computations for regenerating the analysis. While the features of JMP Live will continue to grow over time, JMP offers even more interactivity. And finally, There are industry-specific solutions such as JMP Clinical which are built on a front framework of JMP and SAS that offer all of JMP's interactivity, but with some additional specialization. So when we lay these out on the interactivity spectrum, we can see that JMP Live fills the sweet spot of being portable enough for those with only a web browser to access, while offering many of the prime interactive features that JMP provides So the product that I'll use to demonstrate creating a visualization is JMP Clinical. JMP Clinical, as I mentioned before, offers a way to conveniently assess clinical trial safety and efficacy. With several role-based workflows for medical monitors, writers, clinical operations and data managers, and three review templates, predefined or custom workflows can be conveniently reused on multiple studies, producing results that allow for easy exploration of trends and outliers. Several formats are available for sharing these results, from static reports and in-product review viewer and new to JMP Clinical ??? and JMP Live reports. The product I'll use to demonstrate interacting with on a shared platform is JMP Live. JMP Live allows users with only a web browser to securely and conveniently interact with the visualizations, and they could specify access restrictions for who can view both the graphics and the underlying data tables with the ability to publish a local data filter and column switcher. The view can be refreshed in just a matter of seconds. Users can additionally organize their web reports through titles, descriptions and thumbnails and leave comments that facilitate discussion between all interested parties. So explore the data on your desktop with JMP or JMP Clinical, published a JMP Live with just a few quick steps, share the results with colleagues across your organization, and enrich the shared experience through communication and automation. So now I would like to demonstrate how to publish a simple graphic from JMP to JMP Live. I'm going to open the demographics data set from the sample study Nicardipine, which is included with JMP Clinical. I can do this either through the file open menu where I can navigate to my data set dt= open then the path to my data table. So I'm going to click run scripts to open that data table. Okay. So now I'd like to create a simple visualization. I'm going to, let's say, I'd like to create a simple box plot. Or click graph, Graph Builder. And here I have a dialogue from moving variables into roles. I'm going to move the study site identifier into the X role. Age into Y. And click box plot. And click Done. So here's one quick and easy way to create a visualization in JMP. Alternatively, I can do the same thing with the script. And so this block of code I have here, this encapsulates a data filter and a Graph Builder box plot into a data filter context box. So I'm going to run this block of code. And here you see, I have some filters and a box plot. Now, notice how interactive this filter is and the corresponding graph. I can select a different lower bound for age; I can type in a precise value, let's say, I'd like to exclude those under 30 and suppose I am interested in only the first 10 study side identifiers. OK. So now I'd like to share this visualization with some of my colleagues who don't have JMP but they have JMP Live. So one way to publish this to JMP Live is interactively through the file published menu. And here I have options for for my web report. Can see I have options for specifying a title, description. I can add images. I can choose who to share this report with. So at this point, I could publish this, but I'd like to show you how to do so using the script. So I have this chunk of code where I create a new web report object. I add my JMP report to the web report object. I issue the public message to the web report, and then I automatically open the URL. So let me go ahead and run that. You can see that I'm automatically taken to JMP Live with a very similar structure as my client report. My filter selections have been preserved. I can make filter selection changes. For example, I can move the lower bound for age down and notice also I have detailed data point hover text. I have filter-specific options. And I also have platform-specific options. So any time you see these menus. You can further explore those to see what options are available. Alright, so now that you've seen how to publish a simple graphic from JMP to JMP Live. How about a complex one, as in the case of a JMP Clinical report. So what I'm going to do is open a new review. I will add the adverse events distribution report to this review. I will run it with all default settings. And now I have my adverse events distribution report, which consists of column switchers for demographic grouping and stalking, report filters, an adverse events counts graph, tabulate object for counts and some distributions. Suppose I'm interested in stacking my adverse events by severity. I've selected that and now I have my stoplight colors that I've set for my adverse events for mild, moderate and severe. At this point I'm...I'd like to share these results with a colleague who maybe in this case has JMP, but there are certain times where they prefer to work through a web browser to to inspect and take a look at the visualizations. So this point, I will click this report level create live report button. I will... ...and that...and now I have my dialogue, I can choose to publish to either file or JMP Live. I can choose whether to publish the data tables or not, but I would always recommend to publish them for maximum interactivity. I can also specify whether to allow my colleagues to download the data tables from JMP Live. In addition to the URL, you can specify whether to share the results only with yourself, everyone at your organization or with specific groups. So for demonstration purposes, I will only publish for myself. I'll click OK. Got a notification to say that my web report has been published. Over on JMP Live, I have a very similar structure. At my report filters, my column switchers with my column, a column of interest preserved. You can see my axes and legends and colors have also carried over. Within this web report, I can easily collapse or expand particular report sections, and many of the sections off also offer detailed data point hover text and responsive updates for data filter changes. Another thing I'd like to point out is this Details button in the upper right of the live report, where I can get detailed creation information, a list of the data tables that republished, as well as the script. And because I've given users the ability to download these tables and scripts, these are download buttons for those for that purpose. I can also leave comments from my colleagues that they can then read and take further action on, for example, to follow up on an analysis. All right, so from my final demo, I would simply like to extend my single clinical report to a review consisting of two other reports enrollment patterns, and findings bubble plot. So I'm going to run these reports. Enrollment patterns plots patient enrollment over the course of a study by things like start date of disposition event, study day and study site identifier. Findings bubble plot, I will run on the laboratory test results domain. And this report features a prominent animated bubble plot, in which you can launch this animation. You can see how specific test results change over the course of a study. You can pause the animation. You can scroll to specific, precise values for study day and you can also hover over data points to reveal the detailed information for each of those points. create live report for review. I have a...have the same dialogue that you've seen earlier, same options, and I'm just going to go ahead and publish this now so you can see what it looks like when I have three clinical reports bundled together and in one publication. So when this operation completes, you will see that will be taken to an index page corresponding to report sections. And each thumbnail on this page corresponds to report section in which we have our binoculars icon on the lower left, that indicates how many views each page had. I have a three dot menu, where you can get back to that details view. If you click Edit, from here you can also see creation information and a list of data tables and scripts. And by clicking any of these thumbnails, I can get down to the report, the specific web report of interest. So just because this is one of my favorite interactive features, I've chosen to show you the findings bubble plot on JMP Live. Notice that it has carried over our study day, where we left off on the client, on study day 7. I can continue this animation. You can see study day counting up and you can see how our test results change over time. I can pause this again. I can get to a specific study day. I can do things like change bubble size to suit your preference. Again, I have data point hover text, I can select multiple data points and I have numerous platform specific options that will vary, but I encourage you to take a look at these anytime you see this three dot menu. So to wrap up, let me just jump to my second-last slide. So how was all this possible? Well, behind the scenes, the code to publish a complex clinical report is simply a JSL script that systematically analyzes a list of graphical report object references and pairs them with the appropriate data filters, column switchers, and report sections into a web report object. The JSL publish command takes care of a lot of the work for you, for bundling the appropriate data tables into the web report and ensuring that the desired visibility is met. Power users who have both products can use the download features on JMP Live to conveniently share to conveniently adjust the changes ...to to... make changes on their clients and to update their... the report that was initially published, even if they were not the original authors. And then the cycle can continue, of collaboration between those on the client and those on JMP Live. So, as you can see, both creating and sharing is easy. With JMP and JMP Clinical, collaborative visualization is truly possible. I hope you've enjoyed this presentation, and I look forward to any questions that you may have.  
Mike Anderson, SAS Institute, SAS Institute Anna Morris, Lead Environmental Educator, Vermont Institute of Natural Science Bren Lundborg, Wildlife Keeper, Vermont Institute of Natural Science   Since 1994, the Vermont Institute of Natural Science’s (VINS) Center for Wild Bird Rehabilitation (CWBR), has been working to rehabilitate native wild birds in the northeastern United States. One of the most common raptor patients CWBR treats is the Barred Owl. Barred Owls are fairly ubiquitous east of the Rocky Mountains. Their call is the familiar “Who cooks for you, who cooks for you all.” They have adapted swiftly to living alongside people and, because of this, are commonly presented to CWBR for treatment. As part of a collaboration with SAS, technical staff from JMP and VINS have been analyzing the admission records from the rehabilitation center. Recently we have used a combination of Functional Data Analysis, Bootstrap Forest Modeling, and other techniques to explore how climate and weather patterns can affect the number of Barred Owls that arrive at VINS for treatment — specifically for malnutrition and related ailments. We found that a combination of temperature and precipitation patterns results in an increase in undernourished Barred Owls being presented for treatment. This session will discuss our findings, how we developed them, and potential implications in the broader context of climate change in the Northeastern United States.       Auto-generated transcript...   Speaker Transcript Mike Anderson Welcome, everyone, and thank you for joining us. My name is Anna Morris and I'm the lead environmental Educator at the Vermont Institute of Natural Science or VINS in Quechee, Vermont. I'm Bren Lundborg, wildlife keeper at VINS center for wildlife rehabilitation and I'm Mike Anderson JMP systems engineer at SAS. We're excited to present to you today our work on the effects of local weather patterns on the malnutrition and death rates of wild barred owls in Vermont. This study represents 18 years of data collected on wild owls presented for care at one avian rehabilitation clinics and unique collaboration between our organization and the volunteer efforts of Mike Anderson at JMP. Let's first get to know our study species, the barred owl, with the help of a non releasable rehabilitative bird serving as an education ambassador at the VINS nature center. Yep, this owl was presented for rehabilitation in Troy, New Hampshire in 2013 and suffered eye damage from a car collision from which she was unable to recover. Barred owls like this one are year-round residents of the mixed deciduous forests of New England, subsisting on a diet that includes mammals, birds, reptiles, amphibians, fish and a variety of terrestrial and aquatic invertebrates. However, the prey they consume differs seasonally, with small mammals composing a larger portion of the diet in the winter. Their hunting styles differ in winter as well, due to the presence of snowpack, which can shelter small mammals from predation. Barred owls are known to use the behavior of snow punching or pouncing downward through layers of snow to catch prey detected auditorially. Here's a short video demonstrating this snow punching behavior. I've seen in that quick clip barred owls can be quite tolerant of human altered landscapes, with nearly one quarter of barred owl nests utilizing human structures. There are also the most frequently observed owl species by members of the public in Vermont, according to the citizen science project, iNaturalist, with 468 research grade observations of wild owls logged. As such, barred owls are commonly presented to wildlife rehabilitation clinics by people who discover injured animals. The Vermont Institute of Natural Sciences Center for wild bird rehabilitation or CWBR is the federal and state licensed wildlife rehabilitation facility located in Quechee, Vermont. All wild living avian species that are legal to rehabilitate in the state are submitted as patients to CWBR and we received an average of 405 patients yearly from 2001 to 2019, representing 193 bird species. 90% of patients presented at CWBR come from within 86 kilometers of the facility. Of the patients admitted during the 18 year period of the study, 11% were owls of the order Strix varia, comprising six species, with barred owls being the most common. However, year to year, the number of barred owls received as patients by CWBR has varied widely compared to another commonly received species, the American Robin. Certain years, such as the winter of 2018 to 2019 had been anecdotally considered big barred owl years by CWBR staff and other rehabilitation centers in the Northeastern US for the large number presented as patients. One explanation proposed by local naturalists attributes the big year phenomenon to shifts in weather patterns. When freeze/thaw cycles occur over short time scales, these milder, wetter winters are thought to pose challenges to barred owls relying on snow plunging for prey capture. Specifically the formation of a layer of ice on top of the snow can prevent owls from capturing prey using this snow plunging technique as the owls may not be able to penetrate this ice layer. In order to feed...I lost my place. In order to feed the animals may therefore use alternative hunting locations or styles or suffer from weakness due to malnutrition, which could lead to adverse interactions with humans, resulting in injury. This study was undertaken to determine if a relationship exists between higher than average winter precipitation and the number of barred owls presented during those years at CWBR for rehabilitation. Though there are several possible explanations for the variation in the number of patients associated with regional weather, we sought to determine if there was support for the ice layer hypothesis by further investigating whether barred owls presented during wetter winters exhibited malnutrition as part of the intake diagnosis in greater proportion than in dryer winters. This would suggest that obtaining food was a primary difficulty, leading to the need for rehabilitation, rather than a general population increase, which would likely lead to a proportional increase in all intake categories. Initially we expected that there would be a fairly simple time series analysis relationship to this. We went and looked at the original data for the admissions and just to compare as, as Bren said, just to compare the data between the barred owls and the American robins, you can see for bad years, which I've marked here in blue, except for the the gray one which is actually had a hurricane involved, we can see there's a very strong periodic signal associated with the robins. We can see that the year-round resident barred owls should have something resembling a fairly steady intake rate, but we see some significant changes in that year to year. Looking at the contingency analysis, we can see that the green bands, the starvation, correlates fairly nicely with those years where we have big barred owl years. Again, pointing out 2008, 2015, 2019, these being ski season years instead, which I'll make clear in a moment. 2017 doesn't show up, but it does have a big band of unknown trauma and cause, and that was from a difference in how they were triaging the incoming animals that year. The one, the one trick to working with this is that we needed to use functional data analysis to be able to take the year over year trends and turn them into a signal that we can analyze effectively against weather patterns and other data that we were able to find. Looking here, it's fairly easy to see that those years that we would call bad years have a very distinctive dogear...dogear...dogleg type pattern. You can see 2008, 2017, 2019, 2015. Again, Most importantly, those signals tend to correlate most strongly with this first Eigen function in our principal component analysis. You can see quite clearly here that component one does a great job with discriminating between the good years and the bad years with that odd hurricane year right in the middle where it should be. You can also look at the profiler for FPC one and you can see that as we raise and lower that profiler, we see that dogleg pattern become more pronounced. The next question is, is how do we get the data for that kind of an analysis? How do we get the weather data that we think is important? Well, it turns out that there's a great organization that's a ski resort about 20 miles away from here that has been collecting data from as far back as the 50s. And they've also been working with naturalists and conservation efforts, providing their ecological or their environmental data to researchers for different projects, and they gave us access to their database. This is an example of base mountain temperature at Killington, Vermont, and you can see that the the bad years, again colored in blue here, tend to have a flatter belly in their low temperature. You can see for instance, looking at 2007, the first one in the upper left corner, you can see that there's a steep drop down, followed by a steep incline back up. Whereas 2008, which is one of the bad years for for owl admissions, we have a fairly flat, and if not maybe in a slightly inverted peak in the middle. And that's fairly consistent, with the exception of maybe 2015, throughout the other throughout the other the other years. So I took all of that data and used functional data explorer to get the principal components for our responses. We end up having, therefore, a functional component on the response and a functional component on the factors. This is an example of one of those for the...what turns out to be one of the driving factors of this analysis, and you can see it does a very nice job of pulling out the principal components. The one we're going to be interested in in a moment is this Eigenfunction4. It doesn't look like much right now, but it turns out to be quite important. So let's put all this together. I use the combination of generalized regression, along with the autovalidation strategy that was pioneered by Gotwalt and Ramsay a few years ago to build a model of this of the behavior. We can see we get a fairly good actual by predictive plot for that. We get a nice r square around 99% and looking at the reduced model, we see that we have four primary drivers, the cumulative rain that shows up. That makes sense. We can't have rain without...we can't have ice without rain. Also a temperature factor, we need temperature to have a strong...to have ice. But also we have the sum of the daily snowfall or the daily snowfall. That's a max total snowfall per year, and the sum of the daily...the daily rainfall as well. And taking all of this, we can put together start to put together a picture of what bad barred owl years look like from a data driven standpoint. We can see fairly clearly. I'm going to show you first again what a bad barred...what a bad year looks like from the standpoint of the of the of the the admission rates. And we can see here. Let me show you what a bad, bad year looks like. That's a bad year; that's a good year, fairly dramatic difference. Now we're going to have to pay fairly close attention to the... We're gonna have to pay fairly close attention to the other factors to see because it's a very subtle change in the the temperatures, in the rain falls that trigger this good year/bad year. It's it's kind of interesting how how tiny the effects are. So first, this is the total snowfall per year. And we're going to pay attention to the slope of this curve for a good year and then for a bad year. Fairly tiny change, year over year. So it's a it's a subtle change, but that subtle change is one of the big drivers. We need to have a certain amount of snowfall present in order to facilitate the snow diving. The other thing, if we look at rain, we're going to look at the belly of this rainfall right here, around, around week 13 in the in the ski season. There's a good year. And there's a bad year. Slightly more rain earlier in the year, and with a flatter profile going into spring. And again, looking at the cumulative rain over the season, a good year tends to be a little bit drier than a bad year. And lastly, most importantly, the temperature. This one is actually fairly...this is that belly effect that we were seeing before. We see in early years or in good years that we have that strong decline down and strong climb out in the temperature, but for bad years we get just slightly more bowlshaped effect overall. And I'm going to turn it over to Bren to talk about what that means in terms of barred owl malnutrition. Malnutrition has a significant negative impact upon survival of both free ranging owls and those receiving treatment at a rehabilitation facility. Detrimental effects include reduced hunting success, lessened ability to compete with other animals or predator species for food, and reduced immunocompetence. Some emaciated birds are found too weak to fly and are at high risk for complications such as refeeding syndrome during care. For birds in care, the stress of captivity, as well as healing from injuries such as fractures and traumatic brain injuries can double the caloric needs of the patient, thus putting further metabolic stress on an already malnourished bird. Additionally, scarcity or unavailability of food may push owls closer to human populated areas, leading to increased risk for human related causes of mortality. Vehicle strikes are the most common cause of intakes for barred owls in all years and hunting near roads and human occupied habitats increases that risk. In the winter of 2018 to 2019, reports of barred owls hunting at bird feeders and stalking domestic animals, such as poultry, were common. Hunting at bird feeders potentially increases exposure to pathogens, as they are common sites of disease transmission, it may lead to higher rates of infectious diseases such as salmonellosis and trichomoniasis. Difficult winters also provide extra challenges for first year barred owls. Clutching barred owls are highly dependent on their parents and will remain with them long after being able to fly and hunt. And once parental support ends, they are still relatively inexperienced hunters facing less prey availability and harsher conditions in their first winter. Additionally, the lack of established territories may lead them to be more likely to hunt near humans, predisposing them to risks such as vehicle collision related injuries. Previous research on a close relative of the barred owl, the northern spotted owl of the Pacific Northwest, shows a decline in northern spotted owl in fecundity and survival associated with cold, wet weather in winter and early spring. In Vermont, the National Oceanic and Atmospheric Administration has projected an increase in winter precipitation of up to 15% by the middle of the 21st century, which may have specific impacts on populations of barred owls and their prey sources. The findings of this study provide important implications for the management of barred owl populations and those of related species in the wake of a change in climate. Predicted changes to regional weather patterns in Vermont and New England forecast that cases of malnourished barred owls will only increase in frequency over the next 20 to 30 years as we continue to see unusually wet winters. Barred owls, currently listed by the International Union for Conservation of Nature as a species of least concern with a population trend that is increasing, will likely not find themselves threatened with extinction rapidly. However, ignoring this clear threat to local populations may cascade through the species at large and exacerbate the effects of other conservation concerns, such as accidental poisoning and nest site loss. These findings also highlight the need for protocols to be established on the part of wildlife rehabilitators and veterinarians for the treatment of severe malnourishment in barred owls, such as to avoid refeeding syndrome, and provide the right balance of nutrients for recovery from an often lethal condition. Rehabilitation clinics would benefit from a pooling of knowledge and resources to combat this growing issue. Finally, this study shows yet another way in which climate change is currently affecting the health of wildlife species around us. Individual and community efforts to reduce human impacts on the climate will not be sufficient to reduce greenhouse gas emissions at the scale necessary to halt or reverse the damage that has been done. Action on the part of governments and large corporations must be taken, and individuals and communities have the responsibility to continue to demand that action. We would like to thank the staff and volunteers at the Vermont Institute of Natural Science, as well as at JMP, who helped collect and analyze the data presented here, especially Gray O'Tool. We'd also like to thank the Killington Ski Resort for providing us with the detailed weather data. Thank you.  
Roland Jones, Senior Reliability Engineer, Amazon Lab126 Larry George, Engineer who does statistics, Independent Consultant Charles Chen SAE MBB, Quality Manager, Applied Materials Mason Chen, Student, Stanford University OHS Patrick Giuliano, Senior Quality Engineer, Abbott Structural Heart   The novel coronavirus pandemic is undoubtedly the most significant global health challenge of our time. Analysis of infection and mortality data from the pandemic provides an excellent example of working with real-world, imperfect data in a system with feedback that alters its own parameters as it progresses (as society changes its behavior to limit the outbreak). With a tool as powerful as JMP it is tempting to throw the data into the tool and let it do the work. However, using knowledge of what is physically happening during the outbreak allows us to see what features of the data come from its imperfections, and avoid the expense and complication of over-analyzing them. Also, understanding of the physical system allows us to select appropriate data representation, and results in a surprisingly simple way (OLS linear regression in the ‘Fit Y by X’ platform) to predict the spread of the disease with reasonable accuracy. In a similar way, we can split the data into phases to provide context for them by plotting Fitted Quantiles versus Time in Fit Y by X from Nonparametric density plots. More complex analysis is required to tease out other aspects beyond its spread, answering questions like "How long will I live if I get sick?" and "How long will I be sick if I don’t die?". For this analysis, actuarial rate estimates provide transition probabilities for Markov chain approximation to SIR models of Susceptible to Removed (quarantine, shelter etc.), Infected to Death, and Infected to Cured transitions. Survival Function models drive logistics, resource allocation, and age-related demographic changes. Predicting disease progression is surprisingly simple. Answering questions about the nature of the outbreak is considerably more complex. In both cases we make the analysis as simple as possible, but no simpler.     Auto-generated transcript...   Speaker Transcript Roland Jones Hi, my name is Roland Jones. I work for Amazon Lab 126 is a reliability engineer.   When myself and my team   put together our abstracts for the proposal at the beginning of May, we were concerned that COVID 19 would be old news by October.   At the time of recording on the 21st of August, this is far from the case. I really hope that by the time you watch this in October, there will...things will be under control and life will be returning to normal, but I suspect that it won't.   With all the power of JMP, it is tempting to throw the data into the tool and see what comes out. The COVID 19 pandemic is an excellent case study   of why this should not be done. The complications of incomplete and sometimes manipulated data, changing environments, changing behavior, and changing knowledge and information, these make it particularly dangerous to just throw the data into the tool and see what happens.   Get to know what's going on in the underlying system. Once the system's understood, the effects of the factors that I've listed can be taken into account.   Allowing the modeling and analysis to be appropriate for what is really happening in the system, avoiding analyzing or being distracted by the imperfections in the data.   It also makes the analysis simpler. The overriding theme of this presentation is to keep things as simple as possible, but no simpler.   There are some areas towards the end of the presentation that are far from simple, but even here, we're still working to keep things as simple as possible.   We started by looking at the outbreak in South Korea. It had a high early infection rate and was a trustworthy and transparent data source.   Incidentally, all the data in the presentation comes from the Johns Hopkins database as it stood on the 21st of August when this presentation was recorded.   This is a difficult data set to fit a trend line to.   We know that disease naturally grows exponentially. So let's try something exponential.   As you can see, this is not a good fit. And it's difficult to see how any function could fit the whole dataset.   Something that looks like an exponential is visible here in the first 40 days. So let's just fit to that section.   There is a good exponential fit. Roland Jones What we can do is partition the data into different phases and fit functions to each phase separately.   1, 2, 3, 4 and 5.   Partitions were chosen where the curve seem to transition to a different kind of behavior.   Parameters in the fit function were optimized for us in JMP' non linear fit tool. Details of how to use this tool are in the appendix.   Nonlinear also produced the root mean square error results, the sigma of the residuals.   So for the first phase, we fitted an exponential; second phase was logarithmic; third phase was linear; fourth phase, another logarithmic; fifth phase, another linear.   You can see that we have a good fit for each phase, the root main square error is impressively low. However, as partition points were specifically chosen where the curve change behavior, low root mean square area is to be expected.   The trend lines have negligible predictive ability because the partition points were chosen by looking at existing data. This can be seen in the data present since the analysis, which was performed on the 19th of June.   Where extra data is available, we could choose different partition points and get a better fit, but this will not help us to predict beyond the new data.   Partition points do show where the outbreak behavior changes, but this could be seen before the analysis was performed.   Also no indication is given as to why the different phases have a different fit function.   This exercise does illustrate the difficulty of modeling the outbreak, but does not give us much useful information on what is happening or where the outbreak is heading. We need something simpler.   We're dealing with a system that contains self learning.   As we as society, as a society, learn more about the disease, we modify behavior to limited spread, changing the outbreak trajectory.   Let's look into the mechanics of what's driving the outbreak, starting with the numbers themselves and working backwards to see what is driving them.   The news is full of COVID 19 numbers, the USA hits 5 million infections and 150,000 deaths. California has higher infections than New York. Daily infections in the US could top 100,000 per day.   Individual numbers are not that helpful.   Graphs help to put the numbers into context.   The right graphs help us to see what is happening in the system.   Disease grows exponentially. One person infects two, who infect four, who infect eight.   Human eyes differentiate poorly between different kinds of curves but they differentiate well between curves and straight lines. Plotting on a log scale changes the exponential growth and exponentially decline into straight lines.   Also on the log scale early data is now visible where it was not visible on the linear scale. Many countries show one, sometimes two plateaus, which were not visible   in the linear graph. So you can see here for South Korea, there's one plateau, two plateaus and, more recently, it's beginning to grow for third time.   How can we model this kind of behavior?   Let's keep digging.   The slope on the log infections graph is the percentage growth.   Plotting percentage growth gives us more useful information.   Percentage growth helps to highlight where things changed.   If you look at the decline in the US numbers, the orange line here, you can see that the decline started to slacken off sometime in mid April and can be seen to be reversing here in mid June.   This is visible but it's not as clear in the infection graphs. It's much easier to see them in the percentage growth graph.   Many countries show a linear decline in percentage growth when plotted on a log scale. Italy is a particularly fine example of this.   But it can also be seen clearly in China,   in South Korea,   and in Russia, and also to a lesser extent in many other countries.   Why is this happening?   Intuitively, I expect that when behavior changes, growth would drop down to a lower percent and stay there, not exponentially decline toward zero.   I started plotting graphs on COVID 19 back in late February, not to predict the outbreak, but because I was frustrated by the graphs that were being published.   After seeing this linear decline in percentage growth, I started paying an interest in prediction.   Extrapolating that percentage growth line through linear regression actually works pretty well as a predictor, but it only works when the growth is declining. It does not work at all well when the growth is increasing.   Again, going back to the US orange line, if we extrapolate from this small section here, where it's increasing which is from the middle of June to the end...to the beginning of July,   we can predict that we will see 30% increase by around the 22nd of July, that will go up to 100% weekly growth by the 20th...26th of August, and it will keep on growing from there, up and up and up and up.   Clearly, this model does not match reality.   I will come back to this exponential decline in percentage growth later. For now, let's keep looking at the, at what is physically going on as the disease spreads.   People progress from being susceptible to disease to being infected to being contagious   to being symptomatic to being noncontagious to being recovered.   This is the Markoff SIR model. SIR stands for susceptible, infected, recovered. The three extra stages of contagious, symptomatic and noncontagious helped us to model the disease spread and related to what we can actually measure.   Note the difference between infected and contagious. Infected means you have the disease; contagious means that you can spread it to others. It's easy to confuse the two, but they are different and will be used in different ways, further into this analysis.   The timing shown are best estimates and can vary greatly. Infected to symptomatic can be from three to 14 days and for some infected people,   they're never symptomatic.   The only data that we have access to is confirmed infections, which usually come from test results, which usually follow from being symptomatic.   Even if testing is performed on non symptomatic people, there's about a five-day delay from being infected to having a positive test results.   So we're always looking at all data. We can never directly observe observe the true number of people infected.   So the disease progresses through individual individuals from top to bottom in this diagram.   We have a pool of people that are contagious and that pool is fed by people that are newly infected becoming contagious and the pool is drained by people that are contagious becoming non contagious.   The disease spreads spreads to the population from left to right.   New infections are created when susceptible people come into contact with contagious people and become infected.   The newly infected people join the queue waiting to become contagious and the cycle continues.   This cycle is controlled by transmission.   How likely a contagious person is to infect a susceptible person per day.   the number of people that a contagious person is likely to infect while they are contagious.   This whole cycle revolves around the number of people contagious and the transmission or reproduction.   The time individuals stay contagious should be relatively constant unless COVID 19 starts to mutate.   The transmission can vary dramatically depending on social behavior and the size of the susceptible population.   Our best estimate is the days contagious averages out at about nine.   So we can estimate people contagious as the number of people confirmed infected in the last nine days.   In some respects, this is an underestimate because it doesn't include people that are infected, but not yet symptomatic or that are asymptomatic or that don't yet have a positive test result.   In other respects, it's an overestimate because it includes includes people who were infected, a long time ago, but they're only now being tested as positive. It's an estimate.   From the estimate of people contagious, we can derive the percentage growth in contagious. It doesn't matter if the people contagious is an overestimate or underestimate.   As long as the percentage error in the estimate remains constant, the percentage growth in contagious will be accurate.   Percentage growth in contagious, because within use it to derive transmission,   The derivation of this equation relating the two can be found in the appendix.   Know this equation allows you to derive transmission and then reproduction from the percentage growth in contagious, but it cannot tell you the percentage growth in contagious for a given transmission.   This can only be found by solving numerically.   I have outlined outlined how to do this using JMP's fit model tool in the appendix.   Reproduction and transmission are very closely linked, but reproduction has the advanced...advantage of ease of understanding.   If it is greater than one, the outbreak is expanding out of control. Infections will continue to grow and there will be no end in sight.   If it is less than one, the outbreak is contracting, coming under control. There are still new infections, but their number will gradually decline until they hit zero. The end is in sight, though it may be a long way off.   The number of people contagious is the underlying engine that drives the outbreak.   People contagious grows and declines exponentially. We can predict the path of the outbreak by extrapolating this growth or decline in people contagious. Here we have done it for Russia and Italy and for China.   Remember the interesting observation from earlier, the infections percent in growth percentage growth declines exponentially and here's why.   If reproduction is less than one and constant, people contagious will decline exponentially towards zero.   People contagious drives the outbreak.   The percentage growth in infections is proportional to the number of people contagious. So if people contagious declines exponentially, but percentage growth and infections will also decline exponentially. Mystery solved.   The slope of people contagious plotted on log scale gives us the contagious percentage growth, which then gives us transmission and reproduction through the equations on the last slide.   Notice that there's a weekly cycle in the data. This is particularly visible in Brazil, but it's also visible in other countries as well.   This may be due to numbers getting reported differently at the weekends or by people being more likely to get infected at the weekend. Either way, we'll have to take this seasonality into account when using people contagious to predict the outbreak.   Because social behavior is constantly changing, transmission and reproduction changes as well. So we can't use the whole distribution to generate reproduction.   We chose 17 days as the period over which to estimate reproduction. We found that one week was a little too short to filter out all of the noise, two weeks gave a better results, two and a half weeks was even better. Having the extra half week   evened out the seasonality that we saw in the data.   There is a time series forecast tool in JMP that will do all of this for us, including the seasonality, but because we're performing the regression on small sections of the data, we didn't find the tool helpful.   Here is the derived transmission and reproduction numbers.   You can see that they can change quickly.   It is easy to get confused by these numbers. South Korea is showing a significant increase in reproduction, but it's doing well. The US, Brazil, India and South Africa are doing poorly, but seem to have a reproduction of around one or less.   This is a little confusing.   To help reduce the confusion around reproduction, here's a little bit of calculus.   Driving a car, the gas pedal controls acceleration.   To predict where the car is going to be, you need to know where you are, how fast you're traveling and how much you're accelerating or decelerating.   In a similar way to know where the pandemic is going to be, we need to know how many infections there are, which is the equivalent of distance traveled. We need to know how fast the infections are expanding or how many people are contagious, both of which are the equivalent of speed.   We need to know how fast the people contagious is growing, which is a transmission or reproduction, which is the equivalent of acceleration.   There is a slight difference. Distance grows linearly with speed and speed grows linearly with acceleration.   Infections do grow linearly with people contagious, but people contagious grows exponentially with reproduction.   There is a slight difference, but the principle's the same.   The US, Brazil, India and South Africa have all traveled a long distance. They have high infections and they're traveling at high speed. They have high contagious. Even a little bit of acceleration has a very big effect on the number of infections.   South Korea, on the other hand, on the other hand is not going fast, it has low contagious. So has the headroom to respond to the blip in acceleration and get things back under control without covering much distance   Also, when the number of people contagious is low, adding a small number of new contagious people produces a significant acceleration. Countries that have things under control are prone to these blips in reproduction.   You have to take all three factors into account   (number of infections, people contagious and reproduction) to decide if a country is doing well or doing poorly.   Within JMP there are a couple of ways to perform the regression to get the percentage growth of contagious. There's the Fit Y by X tool and there's the nonlinear tool. I have details on how to use both these tools in the appendix. But let's compare the results they produce.   The graphs shown compare the results from both tools. The 17 data points used to make the prediction are shown in red.   The prediction line from both tools are just about identical, though there are some noticeable differences in the confidence lines.   The confidence lines for the non linear, tool are much better. The Fit Y by X tool transposes that data into linear space before finding the best fit straight line.   This results in a lower cost...in the lower conference line pulling closer to the prediction line after transposing back into the original space.   Confidence lines are not that useful when parameters that define the outbreak are constantly changing. Best case, they will help you to see when the parameters have definitely changed.   In my scripts, I use linear regression calculated in column formulas, because it's easy to adjust with variables. This allows the analysis to be adjusted on the fly without having to pull up the tool in JMP.   I don't currently use the confidence lines in my analysis. So I'm working on a way to integrate them into the column formulas.   Linear regression is simpler and produces almost identical results. Once again, keep it simple.   We have seen how fitting an exponential to the number of people contagious can be used to predict whether people contagious will be in the future, and also to derive transmission.   Now that we have a prediction line for people contagious, we need to convert that back into infections.   Remember new infections equals people contagious and multiplied by transmission.   Transmission is the probability that a contagious person will infected susceptible person per day.   The predicted graphs that results from this calculation are shown. Note that South Korea and Italy have low infections growth.   However, they have a high reproduction extrapolated from the last 17 days worth of data. So, South Korea here and Italy here, low growth, but you can see them taking off because of that high reproduction number.   The infections growth becomes significance between two and eight weeks after the prediction is made.   For South Korea, this is unlikely to happen because they're moving slowly and have the headroom to get things back under control.   South Korea has had several of these blips as it opens up and always manages to get things back under control.   In the predicted growth percent graph on the right, note how the increasing percentage growth in South Korea and this leads will not carry on increasing indefinitely, but they plateau out after a while.   Percentage growth is still seen to decline exponentially, but it does not grow exponentially.   It plateaus out.   So to summarize,   the number of people contagious is what drives the outbreak.   This metric is not normally reported, but it's close to the number of new infections over a fixed period of time.   New infections in the past week is the closest regular reported proxy, the number of people contagious. This is what we should be focusing on, not the number of infections or the number of daily new infections.   Exponential regression of people contagious will predict where the contagious numbers are likely to be in the future.   Percentage growth in contagious gives us transmission and reproduction.   The contagious number and transmission number can be combined to predict the number of new infections in the future.   That prediction method assumes the transmission and reproduction are constant, which they aren't. They change their behavior.   But the predictions are still useful to show what will happen if behavior does not change or how much behavior has to change to avoid certain milestones.   The only way to close this gap is to come up with a way to mathematically model human behavior.   If any of you know how to do this, please get in touch. We can make a lot of money, though only for short amount of time.   This is the modeling. Let's check how accurate it is by looking at historical data from the US.   As mentioned, the prediction works well when reproduction's constant but not when it's changing.   If we take a prediction based on data from late April to early May, it's accurate as long as the prediction number stays at around the same level of 1.0   The reproduction number stays around 1.0.   After the reproduction number starts rising, you can see that the prediction underestimates the number of infections.   The prediction based on data from late June to mid July when reproduction was at its peak as states were beginning to close down again,   that prediction overestimates the infections as reproduction comes down.   The model is good at predicting what will happen if behavior stays the same but not when behavior is changing.   How can we predict deaths?   It should be possible to estimate the delay between infection and death.   And the proportion of infections that result in deaths and then use this to predict deaths.   However, changes in behavior such as increasing testing and tracking skews the number of infections detected.   So to avoid this skew also feeding into the predictions for deaths, we can use the exact same mathematics on deaths that we used on infections. As with infections, the deaths graph shows accurate predictions when deaths reproduction is stable.   Note that contagious and reproduction numbers for deaths don't represent anything real.   This method works because because deaths follow infections and so follow the same trends and the same mathematics. Once again, keep it simple.   We have already seen that the model assumes constant reproduction. It also does not take into account herd immunity.   We are fitting an exponential, but the outbreak really follows the binomial distribution.   Binomial and a fitted exponential differ by less than 2% with up to 5% of the population infected. Graphs demonstrating this are in the appendix.   When more than 5% of the population is no longer susceptible due the previous infection or to vaccination, transmission and reproduction naturally decline.   So predictions based on recent reproduction numbers will still be accurate, however long-term predictions based on an old reproduction number with significantly less herd immunity will overestimate the number of infections.   On the 21st of August, the US had per capita infections of 1.7%   If only 34% of infected people have been diagnosed   as infected, and there is data that indicates that this is likely, we are already at the 5% level where herd immunity begins to have a measurable effect.   At 5% it reduces reproduction by about 2%.   What the model can show us, reproduction tells us whether the outbreak is expanding. It's greater than 1, which is the equivalent of accelerating or its contracting, it's less than 1, the equivalent of decelerating.   Estimated number of people contagious tells us how bad the outbreak is, how fast we're traveling.   Per capita contagious is the right metric to choose appropriate social restrictions.   The recommendations for social restrictions though listed on this slide are adapted from those published by the Harvard Global Health Institute. There's a reference in the appendix.   What they recommend is when there's less than 12 people contagious per million, test and trace is sufficient. When we get up to 125 contagious per millio, rigorous test and trace is required   At 320 contagious per million, we need rigorous test and trace and some stay at home restrictions.   Greater than 320 contagious per million, stay at home restrictions are necessary.   At the time of writing, the US had 1,290 contagious per million, down from 1,860 at the peak in late July.   It's instructional to look at the per capita contagious in various countries and states when they decided to reopen.   China and South Korea had just a handful of people contagious per million.   Europe has in the 10s of people contagious per million except for Italy.   The US had hundreds of people contagious per million when they decided to reopen.   We should not really have reopened in May. This was an emotional decision not a data-driven decision.   Some more specifics about the US reopening.   As I said, the per capita contagious in the US, at the time of writing was 1,290 per million.   1,290 per million, with a reproduction of .94.   With this per capita contagious and reproduction, it will take until the ninth of December to get below 320 contatious per million.   The lowest reproduction during the April lockdown was .86.
PATRICK GIULIANO, Senior Quality Engineer, Abbott Charles Chen, Continuous Improvement Expert, Statistics-Aided Engineering (SAE), Applied Materials Mason Chen, High School Student, Stanford Online High School   Cooked foods such as dumplings are typically prepared without precise process control on cooking parameters. The objective of this work is to customize the cooking process for dumplings based on various dumpling product types. During the cooking process in dumpling preparation, the temperature of the water and time duration of cooking are the most important factors in determining the degree to which dumplings are cooked (doneness). Dumpling weight, dumpling type, and batch size are also variables that impact the cooking process. We built a structured JMP DSD platform with special properties to build a predictive model on cooking duration. Internationally recognized ISO 22000 Food Safety Management and the Hazard Analysis Critical Control Point (HACCP) schemas were adopted. JMP Neural Fit techniques using modern data mining algorithms were compared to RSM. Results demonstrated the prevalence of larger main effects from factors such as: boiling temperature, product type, dumpling size/batch as well as interaction effects that were constrained by the mixture used in the dumpling composition. JMP Robust Design Optimization, Monte Carlo Simulation and HACCP Control Limits were employed in this design/analysis approach to understand and characterize the sensitivity of dumpling cooking factors on the resulting cooking duration. The holistic approach showed the synergistic benefit of combining models with different projective properties, where recursive partition-based AI models estimate interaction effects using classification schema and classical (Stepwise) regression modeling provides the capability to interpret interactions of 2nd order, and higher, including potential curvature in quadratic terms. This paper has demonstrated a novel automated dumpling cooking process and analysis framework which may improve process throughout, lower the cost of energy, and reduce the cost of labor (using AI schema). This novel methodology has the potential to reshape thinking on business cost estimation and profit modeling in the food-service industry.     Auto-generated transcript...   Speaker Transcript Patrick Giuliano All right. Well, welcome everyone.   Thank you all for taking the time to watch this presentation.   Preparing the Freshest Steamed Dumplings.   And my name is Patrick Giuliano and my co authors are Mason Chen and Charles Chen from Applied Materials, as well as Yvanny Chang.   Okay, so today I'm going to tell you about how I harnessed...my team and I harness the power of JMP to really understand about   dumpling cooking.   And so that the general problem statement here is that most foods like dumplings are made without precise control of cooking parameters.   And the taste of a dumpling, and as well as other outputs to measure how good a dumpling is, is adversely affected by improper cooking time and this is intuitive to everyone who's enjoyed food so   we needn't to talk too much about that. But   Sooner or later AI and robotics will be an important part of the food industry.   And our recent experience with Covid 19 has really highlighted that and and so I'm going to talk a little bit about how we   can understand the dumpling process better using a very multi faceted modeling approach, which uses many of JMP's modeling capabilities, including robust Monte Carlo design optimization.   So why dumplings?   Well dumplings are very easy to cook.   And by cooking them, of course, we kill any foreign particles that may be living on them.   And cooking can involve very limited human interaction.   So of course with that, the design and the process space related to cooking is very intuitive and extendable   and we can consider the physics associated with this problem and try to use JMP to help us really understand and investigate the physics better.   AI is really coming sooner or later because of Covid 19, of course, and   why would robotic cooking of dumplings be coming? Well   And also other questions might be, what are the benefits? What are the challenges of cooking dumplings in an automated way in a robotic setting?   And of course, this could be a challenge because actually robots don't have the nose to smell. And so because of that, that's a big reason why, in addition to   an advanced and multifaceted modeling approach, it's important to consider some other structured criteria.   And later in this presentation, I'm going to talk a little bit, a little bit about the HACCP criteria and how we integrated that in order to solve our problem in a more structured way.   Okay, so   before I dive into a lot of the interesting JMP analysis, I'd like to briefly provide an introduction into heat transfer physics, food science and how different heat transfer mechanisms affect the cooking of dumplings.   So as you can see in this slide, there's a   Q   at the top of the diagram and the upper right and that Q is referred to...it refers to the heat flux density, which is the amount of energy that   flows through a unit area per unit time, and the direction of temperature change.   From the point of view of physics, proteins and raw and boiled meat differ in their amounts of energy. An activation energy barrier has to be overcome in order to turn raw meat protein structure into a denatured or compactified structure as shown here.   in this picture at the left.   So the first task of the cook, when boiling meat in terms of physics, is to increase the temperature throughout the volume of the piece   at least   To reach the temperature of the denaturation.   Later, I'm going to talk about the most interesting finding of this particular phase of the experiment where we discovered that there was a temperature cut off.   And and intuitively, you would think that below a certain temperature dumplings would be cooked...wouldn't be cooked properly, they would be to soggy, and above a certain temperature, perhaps they would also be too soggy or they may be burned or crusty.   One other final note about the physics here is that at the threshold for boiling, the surface temperature of the water fluctuates and bubbles will rise to the surface of the boiler   and break apart and collapse and that can make it difficult to gather...capture and...excuse me...to capture accurate readings of temperature.   So that leads us into some...what are some of the tools that we used to conduct an experiment?   Well,   of course, we used a boiling cooker and that's very important.   Of course, we used a something to measure temperature and for this we used an infrared thermometer and we used a timer, of course, and we used a mass balance to weigh the dumpling and all the constituents going into the dumpling.   We might consider something called Gare R&R in future studies and where we may quantify the repeatability and reproducibility of our of our measurement tools.   In this experiment, we didn't, but that is very important, because this helps us maximize the precision of our model estimates by minute...minimizing the noise components associated with our measurement process.   And those noise components could not only be   a fact...a factor of say that accuracy tolerance for the gauge, but they they could also be due to how the person interacts with the with the measurement itself.   And, and, in particular, I'm going to talk a little bit about the challenge with measuring boiling and cooking temperature at high at high temperature.   Okay so briefly,   this...we set this experiment up as a designed experiment. And so we had to decide on the tools first.   We had to decide on how we would make the dumpling. So we needed a manufacturing process and appropriate   role players in that process. And then we had to design a structured experiment. And to do that we use the definitive screening design   and looked at some characteristics of the design to ensure that the design was performing optimally for our experiment.   Next we executed the experiment.   And then   we measured and recorded the response.   And of course, finally, the fun part,   we got to effectively interpret the data and JMP.   And these are graphs that the right here that are showing scatter plot matrices generated in JMP, just using the graph function.   And these actually give us an indication of the uniformity that prediction space. I'll talk a little bit of more more about that later...then next...in the coming slides.   Okay, so here's our data. Here's our data collection plan and at the right is the response that we measured, which is a dumpling rising time or cooking time.   We collected 18 runs in a DSD, which we generated in JMP using the DSD platform and in under the DOE menu.   And we collected information on the mass of the meat, the mass of the veggies going in, the mass of the...the type of the meat, rather, the composition of the vegetables (being either cabbage or mushroom),   and of course the total weight, the sizes of the batch that we cooked, the number of dumplings per batch, and the water temperature.   So this slide just highlights some of the the amazing power of a DSD and and I won't go into this too much, but DSDs are very lauded for their flexible and powerful modeling characteristics.   And they allow the great potential for conducting Blitz screening and optimise optimization in a single experiment.   This chart at the at the right is a correlation matrix generated in JMP, in its designed diagnostics platform of the DOE menu and it and it's it's particularly powerful for showing   the extent of aliasing or confounding among all the factor effects in your model. And what this graphic shows clearly is that   by the darkest blue, there's no correlation and, as the as the correlation increases and we get it to shades of gray, and then finally, as we get to very positive correlation, we get to shades of red. So what we're really seeing is that   main effects are completely uncorrelated with each other, which is what we like to see in two factor interactions, the main effects are uncorrelated.   With with quadratic effects which is up in this right quadrant. And then the quadratic effects are actually only partially correlated with each other and then you have these higher order interaction terms,   which are which are really partially correlated with interaction effects and these types of characteristics make this particular design superior and to the typical Resolution III and Resolution IV fractional factorial designs that that we used to be taught   before the DSDs.   Okay. So just quickly discussing a little bit about the design diagnostics. Here's the same correlation...a similar correlation plot, except the factors have actually been given their particular names   and after in...in running this DSD design. And this is just a gray and white version of of a correlation matrix to help you see the extent of orthoginality or   not, not being ??? among the factors. And so what you can see in our experiment is that we actually did observe a little bit of   confounding among a batch size a batch size and meat unsurprisingly, and then, of course, meat with the interaction between meat and the vegetables that are in the in the dumpling.   And note that we imposed one design constraint here, which we did observe some confounding with, which is the very intuitive constraint in that the dumplings...than the total mass of the dumpling is the composition of each of the components of the dumpling itself.   So,   the other...   so why, why are we doing this?   Why are we assessing this quote unquote uniformity here, this the scatter plot matrix here and and what is this kind of telling us? Well,   in order to maximize prediction capability throughout the space of the of the predictor, in rising time in this case, we want to find the combinations of the factors that minimize the white areas. Okay. And in the white areas are where the prediction accuracy is thought to be weaker.   And this is a and this is why we take the design and we put it into a scatter plot matrix and this is analogous to sort of the homogeneity of error assumption in ANOVA,   where you know we look for a space, the space of prediction to be equally probable, or the equal variance assumption in linear regression.   When we want we want this space to be equally probable across the range of the predictors.   So in in this in this experiment, of course, in order to reduce the number of factors that we're looking at, first we used engineering and our understanding of the engineering and the physics of the problem.   And so for identification, we identified six independent variables or variables that were least confounded with each other and and and we proceeded and with the analysis on the basis of these primary variables.   Okay.   So the first thing we did is we we took our generated design and we use stepwise regression to try to simplify the model, identify only the active factors in the model.   And here you can use a combination of forward selection, backwards selection, and mixed as your stopping criteria in order to determine the model that explains the most variation in your response.   And also, I can model meat type as discrete and or numeric or...rather discrete numeric, and in this way I can use this labeling to make the factor coding to correspond to   the meat type being the shrimp or the pork, which we used.   So what kind of a stopping rule can you use in the framework of this type of a regression model? Well,   unfortunately, when I ran this model again and again, I wasn't really able to reproduce it exactly. And model reproduction can be somewhat inconsistent, since the fit...this type of a fitting schema involves a computational procedure to iterate to a solution.   And so therefore, in this stepwise regression, the overfit risk is is typically higher.   And oftentimes if there's any curvature in the model   or there are two factor interactions, for example, the, the explanatory variants across...is shared across both of those factors where you can't tease apart that variability associated with one or the other.   And so what we can see here clearly, based on the adjusted R squared, is that we're getting a very good fit, and probably a fit that's too good.   Meaning that we can't predict in the future based on them on the fit to this particular model.   Okay.   So here's where it gets pretty interesting. So   one of the things that we did first off, after running the stepwise is that we assigned independent uniform inputs to each of the factors in the model.   And this is a sort of Monte Carlo implementation in JMP.   A different kind of Monte Carlo implementation and and and   it's a what's what's what's important to understand in this particular framework is that the difference between the main effect and and the total effect can indicate the extent of interaction.   hat that this the extent of interaction associated with a particular factor in the model. And so this is showing that in particular, water temperature and meat,   in addition to being most explanatory in terms of total effect, may likely interact with other factors in this particular model.   And what what you see, of course, is that we identified water temperature, meat, and and and the meat type as our top predictors, using the Paredo plot for transformed estimates.   The other thing I'd like to highlight here before I launch into some of the other slides is the sensitivity indicate indicator that we can invoke here and   under the profiler after we assign independent uniform inputs,   we can colorize the profile profiler to indicate the strength of the relationship between each of the input input factors and the and the response. And we can also use   the sensitivity indicator, which is represented by these purple triangles, to show us the sensitivity or the you can say the strength of the relationship   similar to the linear regression coefficient would indicate the strength, where the taller the triangle and the steeper the relationship,   the stronger either in the positive or the negative direction and the wider and flatter the triangle, the less strong that relationship that's that factor plays.   Okay.   So we went about reducing our model and using some engineering sense and using the stepwise platform.   And what we get   is a this is a just a snapshot of our model fit from the originating from the DSD design and it has RSM structured as curvature. And you can see this is an interaction plot which shows the extent of interaction among all the factors in a in a pairwise way.   And we've indicated where some of the interactions are present and what those interactions look like.   So this is a model that really, we can get more of a handle on   Okay, so   I think one other thing to mention is that the design constraint that we imposed in is is similar to what you might consider a mixture design, where all the components add together and the constraint has to sum to 100%.   Okay, so here's just a high level picture of the profiler and we we can adjust or modulate each of the input factors and then   observe the, the impact on the response and and we did this in a very manual way   just to get gain some intuition into how the model is performing   And of course to optimize our cooking time what we confirmed was that the time has to be faster, of course, the variants associated with the cooking time should be lower.   And the throughput the throughput and the power savings should be optimized, maximized. And those are two additional   responses that we derived based on cooking time.   Okay, so here's where we get into the optimization more fully into that optimization   of the of the cooking process. And so as I mentioned before, we designed or we created two additional response variables that are connected to the physics, where we have maximum throughput and that depends on in how many dumpling...   I'm sorry, depends on how many dumplings, but also weight and time.   And power savings, which is the which is the product of the power consumed and the time for cooking, which is an energy component.   And so in order to engage in this optimization process,   we need to introduce error associated with each of the input factors and that's represented by these distributions at the bottom here.   And and we also need to consider that the practical HACCP control window and of course measurement device capability, which is something that we would like to look at in future studies.   And so here's just a, a nice picture of the HACCP control plan that we use and this is follows very similar to something like a failure modes and effect analysis in the quality profession. And it's just a structured approach to   experimentation or manufacturing process development and where key variables are identified,   and key owners and then what criteria are being measured against and how that criteria being validated. And so HACCP is actually common in the food science industry and it stands for Hazard Analysis Critical Control Point monitoring.   And I think   in addition to all of these preparation activities, mainly I was involved in the context of this experiment as a data collector and and data integrity is a very important thing. And so   transcribing data appropriately is is definitely is definitely important.   So all the HACCP control points need to be incorporated into the Monte Carlo simulation range and ultimately the HACCP tolerance range can be used to derive the process performance requirement.   Okay, so   we consider a range of inputs where Monte Carlo can confirm that this expected range is practical for the cooking time.   We want to consider a small change in each of the input factors at each at each level each at each HACCP component level and   and this is determined by the control point range. Based on the control point range, we can determine the delta x and the delta in each of the inputs from the delta y response time.   And we can continue to increase the delta x incrementally iteratively,   while hoping that the the increase is small enough so that the change in y is small enough to meet the specification. And usually in industry that's a design tolerance and in this case, it's our HACCP control point control parameter range or control parameter limit.   And if if that iterative procedure fails, then we have to make the increment and X smaller and we call this procedure tolerance allocation. Okay.   We did this sort of manually and using sort of as our own special recipe. Although this can be done in a more automated way in JMP and   and in this case, you can see we have more all of our responses. So using multiple response optimization, as well as which would involve using invoking the desirability functions and maximizing the desirability under the prediction profiler   as well as a Gaussian process modeling,   also available under the prediction profiler.   Okay. So next in the vein of, you know, using tools to complement each other and try to understand further understand our product...our process and our experiment, we use the, the neural modeling capability under the analyze menu, under the prediction modeling tools.   And   We, we tried to utilize it to facilitate our prediction.   This model uses a TanH function, which can be more powerful to detect curvature in non linear effects,   but it's sort of a black box and it doesn't really tie back to the physics. So   while it's also robust to non normal responses and somewhat robust to aliasing confounding.   it, it has its limitations, particularly with a small sample size, such as that we have here, and you can actually see that the r squared between the training and validation sets are not particularly   the same or they vary so this model isn't particularly consistent for the purposes of prediction.   Finally, we used the the partition platform in JMP to run recursive partitioning on our   response time response.   And and this model is is definitely relatively easy to interpret in general, but I think particularly for our experiment because we can see that for the rising time we have this temperature cut off at about 85 degrees C,   and that and as well as some temperature differentiation with respect to maximum throughput, but in particular is 85 degrees cut off is most...is very interesting.   The R squared note in this model is about .7, at least with respect to the rising temp response, which is pretty good for this type of a model, considering the small sample size.   And   what's most interesting with respect the to this cut off is that below 85 C, the water really wasn't boiling. There wasn't   much bubbling, no turbulence. The reading was very stable. However, as we increased the temperature, the water started to circulate, turbulence in the water caused non uniform temperature, cavitation bubble collapse, steam rising, and it's basically an unstable temperature environment.   In this type of environment convection dominates rather than conduction.   And steam also blocks light of the infrared thermometer, which also increases increases the uncertainty associated with the temperature measurement.   And and the steam itself presents a burn risk which, in addition to safety, it may impact how the operator adjusts the thermometer and puts the the adjust the distance in which the operator places the thermometer, which is very important for accuracy of measurement.   So, and this, in fact, was why we capped our design at 95 C because it was really impossible to measure water temperature accurately any more above that.   Okay.   So what are ...where have we arrived here? Well,   in summary, we...in this experiment we use DSD (DOE) to collect the data only.   Then we use stepwise regression to narrow down the important effects, but we didn't go too deep into the stepwise regression. And we use common sense to minimize the number of factors in our experiment as well as engineering expertise.   We also use independent uniform inputs,   which is very powerful   for giving us an idea of the magnitude of effects using, for example by colorizing the profile or by looking at the rank of the effects and also by looking at the difference between the main effect and the total effect to give us an indication of interaction present in the model.   We also added sensitivity indicators under the profiler to help us quantify our global sensitivity for the purposes of the Monte Carlo optimization   schema that we employed.   The main effects in the model really, temperature, of course, and physics really explained explains why temperature's the number one factor as, as I've shared in our findings.   And in addition, from between 80-90 degrees C, what we see from the profilers that we observed sort of a rapid transition and an increase in the sensitivity of the relationship between rising time and temperature which is, of course, consistent with our experimental observations.   Secondly, with respect to the the effects of factors interacting with each other and because there are two different types of physics really interacting, basic physics...physics modes interacting are convection and conduction,   the stepwise on the DSD is a good starting point, because it gives us a continuous model with no transformation   With no   Advanced neural or black box type transformation, wo we can at least get a good handle on on global sensitivity to begin with.   And our neural models in our partition models couldn't show us this, particularly given the small sample size in our experiment.   And finally, we use Monte Carlo simulate, robust Monte Carlo simulation   in our own framework. And we also did a little bit of multiple response optimization on rising time and throughput in power consumption versus our important factors. And through this experiment, we began to really qualify and further our understanding of the importance of   the most important factors in this experiment using a multi disciplinary modeling approach.   Finally, I will share some references here for you for of interest. Thank you very much for your time and
Shamgar McDowell, Senior Analytics and Reliability Engineer, GE Gas Power Engineering   Faced with the business need to reduce project cycle time and to standardize the process and outputs, the GE Gas Turbine Reliability Team turned to JMP for a solution. Using the JMP Scripting Language and JMP’s built-in Reliability and Survival platform, GE and a trusted third party created a tool to ingest previous model information and new empirical data which allows the user to interactively create updated reliability models and generate reports using standardized formats. The tool takes a task that would have previously taken days or weeks of manual data manipulation (in addition to tedious copying and pasting of images into PowerPoint) and allows a user to perform it in minutes. In addition to the time savings, the tool enables new team members to learn the modeling process faster and to focus less on data manipulation. The GE Gas Turbine Reliability Team continues to update and expand the capabilities of the tool based on business needs.       Auto-generated transcript...   Speaker Transcript Shamgar McDowell Maya Angelou famously said, "Do the best you can, until you know better. Then when you know better, do better." Good morning, good afternoon, good evening. I hope you're enjoying the JMP Discovery Summit, you're learning some better way ways of doing the things you need to do. I'm Shamgar McDowell, senior reliability and analytics engineer at GE Gas Power. I've been at GE for 15 years and have worked in sourcing, quality, manufacturing and engineering. Today I'm going to share a bit about our team's journey to automating reliability modeling using JMP. Perhaps your organization faces a similar challenge to the one I'm about to describe. As I walk you through how we approach this challenge, I hope our time together will provide you with some things to reflect upon as you look to improve the workflows in your own business context. So by way of background, I want to spend the next couple of slides, explain a little bit about GE Gas Power business. First off, our products. We make high tech, very large engines that have a variety of applications, but primarily they're used in the production of electricity. And from a technology standpoint, these machines are actually incredible feats of engineering with firing temperatures well above the melting point of the alloys used in the hot section. A single gas turbine can generate enough electricity to reliably power hundreds of thousands of homes. And just to give an idea of the size of these machines, this picture on the right you can see there's four adult human beings, which just kind of point to how big these machines really are. So I had to throw in a few gratuitous JMP graph building examples here. But the bubble plot and the tree map really underscore the global nature of our customer base. We are providing cleaner, accessible energy that people depend upon the world over, and that includes developing nations that historically might not have had access to power and the many life-changing effects that go with it. So as I've come to appreciate the impact that our work has on everyday lives of so many people worldwide, it's been both humbling and helpful in providing a purpose for what I do and the rest of our team does each day. So I'm part of the reliability analytics and data engineering team. Our team is responsible for providing our business with empirical risk and reliability models that are used in a number of different ways by internal teams. So in that context, we count on the analyst in our team to be able to focus on engineering tasks, such as understanding the physics that affect our components' quality and applicability of the data we use, and also trade offs in the modeling approaches and what's the best way to extract value from our data. These are, these are all value added tasks. Our process also entails that we go through a rigorous review with the chief engineers. So having a PowerPoint pitch containing the models is part of that process. And previously creating this presentation entailed significant copying and pasting and a variety of tools, and this was both time consuming and more prone to errors. So that's not value added. So we needed a solution that would provide our engineers greater time to focus on the value added tasks. It would also further standardize the process because those two things greater productivity and ability to focus on what matters, and further standardization. And so to that end, we use the mantra Automate the Boring Stuff. So I wanted to give you a feel for the scale of the data sets we used. Often the volume of the data that you're dealing with can dictate the direction you go in terms of solutions. And in our case, there's some variation but just as a general rule, we're dealing with thousands of gas turbines in the field, hundreds of track components in each unit, and then there's tens of inspections or reconditioning per component. So in in all, there's millions of records that we're dealing with. But typically, our models are targeted at specific configurations and thus, they're built on more limited data sets with 10,000 or fewer records, tens of thousands or fewer records. The other thing I was going to point out here is we often have over 100 columns in our data set. So there are challenges with this data size that made JMP a much better fit than something like an Excel based approach to doing this the same tasks. So, the first version of this tool, GE worked with a third party to develop using JMP scripting language. And the name of the tool is computer aided reliability modeling application or CARMA, with a c. And the amount of effort involved with building this out to what we have today is not trivial. This is a representation of that. You can see the number of scripts and code lines that testified to the scope and size of the tool as it's come to today. But it's also been proven to be a very useful tool for us. So as its time has gone on, we've seen the need to continue to develop and improve CARMA over time. And so in order to do this, we've had to grow and foster some in-house expertise in JSL coding and I oversee the work of developers that focus on this and some related tools. Message on this to you is that even after you create something like CARMA, there's going to be an ongoing investment required to maintain and keep the app relevant and evolve it as your business needs evolve. But it's both doable and the benefits are very real. A survey of our users this summer actually pointed to a net promoter score of 100% and at least 25% reduction in the cycle time to do a model update. So that's real time that's being saved. And then anecdotally, we also see where CARMA has surfaced issues in our process that we've been able to address that otherwise might have remained hidden and unable to address. And I have a quote, it's kind of long. But I wanted to just pass this caveat on automation from Bill Gates, on which he knows a thing or two about software development. "The first rule of any technology used in business is that automation applied to an efficient automation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency." So that's the end of the quote, but this is just a great reminder that automation is not a silver bullet that will fix a broken process, we still need people to do that today. Okay, so before we do a demonstration of the tool. I just wanted to give a high level overview of of the tool and the inputs and outputs in CARMA. And user user has to point the tool to the input files. So over here on the left, you see we have an active models file that's essentially the already approved models. And then we have empirical data. And then in the user interface, the user does some modeling activities. And then outputs are running models, so updates to the act of models in a PowerPoint presentation. And we'll also look at that. As a background for the data, I'll be using the demo. I just wanted to pass on that I started with the locomotive data set. And we'll see that JMP provides some sample data. So this is the case, there and that gives one population. Then I also added into additional population of models. And the big message here I wanted to pass on is that what we're going to see is all made-up data. It's not real; it doesn't represent the functionality or the behavior of any of our parts in the field, or it's it's just all contrived. So keep that in mind as we go through the results, but it should give us a way to look at the tool, nonetheless. So I'm going to switch over to JMP for a second and I'm using JMP 15.2 for the demo. And this data set is simplified compared to what we normally see. But like I said, it should exercise the core functionality in CARMA. So first, I'm just going to go to the Help menu, sample data, and you'll see the reliability and survival menu here. So that's where we're going. One of the nice things about JMP is that it has a lot of different disciplines and functionality and specialized tools for that. And so for my case with reliability, there's a lot here, which also lends to the value of using JMP is a home for CARMA. But I wanted to point you to the locomotive data set and just show you... this originally came out of a textbook. And talks to it here applied life data analysis. So, in that, there's a problem that asks you what the risk is at 80,000 exposures and we're going to model that today in our data set in an oxidation model is what we've called it, but essentially CARMA will give us an answer. Again, a really simple answer, but I was just going to show you, you can get the same way by clicking in the analysis menu. So we go down to an analyze or liability and survival, life distribution. Put the time and sensor where they need to go. We're going to use Weibull and just the two...so it creates a fit for that data. Two parameters I was going to point out is the beta, 2.3, and then it's called a Weibull alpha here. In our tool, it'll be called Ada, but 183. Okay, so we see how to do that here. Now just to jump over, want to look at a couple of the other files, the input files so I will pull those up. Okay, this is the model file. I mentioned I made three models. And so these are the active models that we're going to be comparing the data against. You'll see that oxidation is the first one, I mentioned that, and then you want...one also, in addition to having model parameters, it has some configuration information. This is just two simple things here (combustion system, fuel capability) I use for examples, but there's many, many more columns, like it. But essentially what CARMA does, one of the things I like about it is when you have a large data set with a lot of different varied configurations, it can go through and find which of those rows of records applies to your model and do the sorting real time, and you know, do that for all the models that you need to do in the data set. And so that's what we're going to use that to demonstrate. Excuse me. Also, just look, jump over to the empirical data for a minute. And just a highlight, we have a sensor, we have exposures, we have the interval that we're going to evaluate those exposures at, modes, and then these are the last two columns I just talked about, combustion system and fuel capability. Okay, so let's start up CARMA. As an add in, so I'll just get it going. And you'll see I already have it pointing to the location I want to use. And today's presentation, I'm not gonna have time to talk through all the variety of features that are in here. But these are all things that can help you take and look at your data and decide the best way to model it, and to do some checks on it before you finalize your models. For the purposes of time, I'm not going to explain all that and demonstrate it, but I just wanted to take a minute to build the three models we talked about create a presentation so you can see that that portion of the functionality. Excuse me, my throat is getting dry all the sudden so I have to keep drinking; I apologize for that. So we've got oxidation. We see the number of failures and suspensions. That's the same as what you'll see in the text. Add that. And let's just scroll down for a second. That's first model added Oxidation. We see the old model had 30 failures, 50 suspensions. This one has 37 and 59. The beta is 2.33, like we saw externally and the ADA is 183. And the answer to the textbook question, the risk of 80,000 exposures is about 13.5% using a Weibull model. So that's just kind of a high level of a way to do that here. Let's look at also just adding the other two models. Okay, we've got cracking, I'm adding in creep. And you'll see in here there's different boxes presented that represent like the combustion system or the fuel capability, where for this given model, this is what the LDM file calls for. But if I wanted to change that, I could select other configurations here and that would result in changing my rows for FNS as far as what gets included or doesn't. And then I can create new populations and segment it accordingly. Okay, so we've gotten all three models added and I think, you know, we're not going to spend more time on that, just playing with the models as far as options, but I'm gonna generate a report. And I have some options on what I want to include into the report. And I have a presentation and this LDM input is going to be the active models, sorry, the running models that come out as a table. All right, so I just need to select the appropriate folder where I want my presentation to go And now it's going to take a minute here to go through and and generate this report. This does take a minute. But I think what I would just contrast it to is the hours that it would take normally to do this same task, potentially, if you were working outside of the tool. And so now we're ready to finalize the report. Save it. And save the folder and now it's done. It's, it's in there and we can review it. The other thing I'll point out, as I pull up, I'd already generated this previously, so I'll just pull up the file that I already generated and we can look through it. But there's, it's this is a template. It's meant for speed, but this can be further customized after you make it, or you can leave placeholders, you can modify the slides after you've generated them. It's doing more than just the life distribution modeling that I kind of highlighted initially. It's doing a lot of summary work, summarizing the data included in each model, which, of course, JMP is very good for. It, it does some work comparing the models, so you can do a variety of statistical tests. Use JMP. And again, JMP is great at that. So that, that adds that functionality. Some of the things our reviewers like to see and how the models have changed year over year, you have more data, include less. How does it affect the parameters? How does it change your risk numbers? Plots of course you get a lot of data out of scatter plots and things of that nature. There's a summary that includes some of the configuration information we talked about, as well as the final parameters. And it does this for each of the three models, as well as just a risk roll up at the end for for all these combined. So that was a quick walkthrough. The demo. I think we we've covered everything I wanted to do. Hopefully we'll get to talk a little more in Q&A if you have more questions. It's hard to anticipate everything. But I just wanted to talk to some of the benefits again. I've mentioned this previously, but we've seen productivity increases as a result of CARMA, so that's a benefit. Of course standardization our modeling process is increased and that also allows team members who are newer to focus more on the process and learning it versus working with tools, which, in the end, helps them come up to speed faster. And then there's also increased employee engagement by allowing engineers to use their minds where they can make the biggest impact. So I also wanted to be sure to thank Melissa Seely, Brad Foulkes, Preston Kemp and Waldemar Zero for their contributions to this presentation. I owe them a debt of gratitude for all they've done in supporting it. And I want to thank you for your time. I've enjoyed sharing our journey towards improvement with you all today. I hope we have a chance to connect in the Q&A time, but either way, enjoy the rest of the summit.