Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Brian Corcoran, JMP Director of Research and Development, SAS Dieter Pisot, JMP Principal Application Developer, SAS Eric Hill, JMP Distinguished Software Developer, SAS   You know the value of sharing insights as they emerge. JMP Live — the newest member of the JMP product family — reconceptualizes sharing by taking the robust statistics and visualizations in JMP and extending them to the web, privately and securely. If you'd like a more iterative, dynamic and inclusive path to showing your data and making discoveries, join us. We'll answer the following questions: What is JMP Live? How do I use it? How do I manage it? For background information on the product, see this video from Discovery Summit Tucson 2019 and the JMP Live product page.   JMP Live Overview (for Users and Managers) – Eric Hill What is JMP Live? Why use JMP Live? Interactive publish and replace What happens behind the scenes when you publish Groups - From a user perspective Scripted publishing Stored credentials API Key Replacing reports Setup and Maintenance (for JMP Live Administrators) – Dieter Pisot Administering users and groups Limiting publishing Setting up JMP Live Windows services .env files Upgrading Applying a new license Using Keycloak single sign-on Installing and Setting up the Server (for IT Administrators) – Brian Corcoran Choosing architectural configurations based on expected usage Understanding SSL Certificates and their importance Installing the JMP Live Database component Installing the JMP Pro and JMP Live components on a separate server Connecting JMP Live to the database Testing installed configuration to make sure it is working properly (view in My Videos)     (view in My Videos)   (view in My Videos)
Dieter Pisot, JMP Principal Systems Engineer, SAS Stan Koprowski, JMP Senior Systems Engineer, SAS   Data changes, and so do your JMP Live reports. Typical data changes that involve additional observations or modifications to columns of data necessitate updates to published reports. With the first scenario, an existing report might need to be recalculated to reflect the new observations or rows of data that are used in the report. A second option is when you want to restructure the underlying data by adding or removing columns of information that are used in the report. With both situations you must update your report on a regular basis. In this paper we will provide practical examples of how to organize JSL scripts that facilitate the replacement of an existing JMP Live report with a current version. Prior to the live demonstration, we will discuss all key security protocols, including protecting credentials needed to connect to JMP Live. The code presented are designed to be reused and shared with anyone who has a need to publish or replace a JMP Live report on a predefined time interval, such as hourly, daily, weekly or monthly. With some basic JSL knowledge you can easily adopt them for your automated updates to any of your other existing JMP Live reports. Not a coder? No worries, we've got your back. Additionally, we will provide a JMP add-in that schedules the publishing of a new report or the publishing of a replacement report for those with little JSL knowledge using a wizard-based approached.
Zhiwu Liang, Principal Scientist, Procter & Gamble Pablo Moreno Pelaez, Group Scientist, Procter & Gamble   Car detailing is a tough job. Transforming a car from a muddy, rusty, full of pet fur box-on-wheels into a like-new clean and shiny ride takes a lot of time, specialized products and a skilled detailer. But…what does the customer really appreciate on such a detailed car cleaning and restoring job? Are shiny rims most important for satisfaction? Interior smell? A shiny waxed hood? It is critical for a car detailer business to know the answers to these questions to optimize the time spent per car, the products used, and the level of detailing needed at each point of the process. With the objective of maximizing customer satisfaction and optimizing the resources used, we designed a multi-stage customer design of experiments. We identified the key vectors of satisfaction (or failure), defined the levels for those and approached the actual customer testing in adaptive phases, augmenting the design in each of them. This poster will take you through the thinking, designs, iterations and results of this project. What makes customers come back to their car detailer? Come see the poster and find out! (view in My Videos) View more... (Highlight to read)   Speaker Transcript Zhiwu Liang Hello, everyone. I'm Zhiwu Liang statistician for Brussels Innovation Center for Procter and Gamble company I'm I'm working   For the r&d department. Hello. Pablo Moreno Pelaez Yep. So I'm Pablo Moreno Pelaez I'm working right now in Singapore in the r&d   Department for Procter and Gamble's   So we wanted to introduce to you this poster where we want to share a case study in which we wanted to figure out what makes a car detailing your grade.   So as you know, Procter and Gamble, the very famous company about job detailing for cars. No, just a joke. So we had to anonymize or what have they done. So this is the way   We wanted to share this case study, putting it in the context of a car detailing job and what we wanted to figure out here is what were the key customer satisfaction factors for which we then   Build interactive design that we then tested with some of those customers to figure out how to build the model and how to optimize   Job detailing for the car. So how do we minimize the use of some of our ingredients. How do we minimize the time we take for some of the tasks that it takes to do the job details.   So if you go to the next slide. And the first thing that we went to, to take a look. Yes.   Okay, what are the different vectors that a customer we look at when they they take the car to get detail and to get clean and shiny and go back home with a buddy.   A brand new car. What are they looking at clean attributes, they're looking at Shane attributes and they are looking at the freshness of the guy.   From a culinary view that we looked at the exterior cleaning the cleaning of the rooms are the king of the interior   The shine of the overall body, the rooms that windows and of course the overall freshness of the interior   And then we'll wanted to build this by modifying these attributes in different ways and combining the different finishes that it a potential   Car detailing job would give you wanted to estimate and be able to build the model to calculate what the overall satisfaction.   And also what the satisfaction with a cleaning and what their satisfaction with the shine.   Would be modifying those different vectors. These will allow us in the future to use the model.   To estimate. Okay, can we reduce the time that we spend on the rooms, because it's not important, or can we reduce the time that we spend on the interior or reduce the amount of products that we use for freshness. If those are not important.   So really, to then optimize how do we spend the resources on delivering the the car detailing jobs.   So in the next slide.   You can see a little bit with the faces of the study where Zhiwu Liang Yeah, so as popular as sad as the cart. The planning job company. We are very focused on the consumer satisfaction. So for this particular job.   What we have to do is identify what is the key factors which drive the consumer overall satisfaction and clean and shine satisfaction. So in order to do that we separate or our study design and   Data collection experiments industry step. First, we do the Pilar, which is designed to five different scenario. Okay, using the fire cars.   To set up the different level of each offer factors as a moment. We said, have to all of these five factor previous public described in the to level one is low and not as high.   Then we recruit the 20 consumers to evaluate all of the five cards in a different order. The main objective for this Pilar is check the methodology and track the   If the question we asked consumers consumers understand and provide the correct answer, and also define the proper range of each factor.   So after that, we go to the phase one, which is the extend our design space by seven factors. Okay. Some factors to keep, low, high level as we do the Pilar. Some extent to the low, medium, high because we think is more relevant to the consumer by including the more level in our factor.   And since we got more factor and from the customer design point of view, you will generate more   Experiments runs in our study so totally we have an it runs of the cost setting and each of the panelists. We ask them to   Evaluate still five but using the different order or different combination therefore accepted the custom design. When the consumer need to evaluate   Five out of the 90 I said him. We have to using the balance in company blog design technique and to use 120 customers, each of them evaluate five cars.   So totally this   120 customer data we collect we run the model identify what is the main effect. Okay, and what is the interaction in our model.   Then through that we three hours. Not important factor and go to the face to face to using the funnel identified the six factors and for course adding more level for some factor because we saying that low is not low enough in the faith from phase one study and middle it's not   Really matched to our consumer satisfaction. So we had some level of quality lol some factor level for the middle, high   Inserting currently design space, then   The face to design experiments his argument from the phase one.   Was that we get a different okay setting for the 90 different cars, then asked 120 consumer evaluate five in a different camp one   Through that we can remove non we can identify okay what is, what is the past.   Factor setting which have the optimal solution for the consumer satisfaction and clean and shine satisfaction. So as you can see here   We run the 3D model using our   Six factors setting.   Which each of them has played some role for the consumer satisfaction intense or cleaning as shine satisfaction.   For the overall. Clearly, we can see the cleaning ring and shine window cleaning in interior is the key driver for the overall satisfaction. So if consumers in the ring clean and window shine. Normally, either we all agree, he was satisfied for the   For our car detailing job and also we identified significant interaction.   Exterior clean and intuitive clean these two things combined together has different rate to the overall satisfaction with a clean satisfaction and the shine satisfaction model.   We identified very close, very significantly impact factors for clean   Clearly, all of the clean factor relate to the clean satisfaction and for shine also all of the shine one relate to the shine satisfaction.   But still the different perspective lighter clean his focus on the ring and shine is focused on the window. So from validating, we can have the better setting for all the car.   Relief factor which helpers to divide the new projects which achieved the best consumer satisfaction based on all of the   Factors setting. I think   Speaker Transcript Zhiwu Liang Hello, everyone. I'm Zhiwu Liang statistician for Brussels Innovation Center for Procter and Gamble company I'm I'm working   For the r&d department. Hello. Pablo Moreno Pelaez Yep. So I'm Pablo Moreno Pelaez I'm working right now in Singapore in the r&d   Department for Procter and Gamble's   So we wanted to introduce to you this poster where we want to share a case study in which we wanted to figure out what makes a car detailing your grade.   So as you know, Procter and Gamble, the very famous company about job detailing for cars. No, just a joke. So we had to anonymize or what have they done. So this is the way   We wanted to share this case study, putting it in the context of a car detailing job and what we wanted to figure out here is what were the key customer satisfaction factors for which we then   Build interactive design that we then tested with some of those customers to figure out how to build the model and how to optimize   Job detailing for the car. So how do we minimize the use of some of our ingredients. How do we minimize the time we take for some of the tasks that it takes to do the job details.   So if you go to the next slide. And the first thing that we went to, to take a look. Yes.   Okay, what are the different vectors that a customer we look at when they they take the car to get detail and to get clean and shiny and go back home with a buddy.   A brand new car. What are they looking at clean attributes, they're looking at Shane attributes and they are looking at the freshness of the guy.   From a culinary view that we looked at the exterior cleaning the cleaning of the rooms are the king of the interior   The shine of the overall body, the rooms that windows and of course the overall freshness of the interior   And then we'll wanted to build this by modifying these attributes in different ways and combining the different finishes that it a potential   Car detailing job would give you wanted to estimate and be able to build the model to calculate what the overall satisfaction.   And also what the satisfaction with a cleaning and what their satisfaction with the shine.   Would be modifying those different vectors. These will allow us in the future to use the model.   To estimate. Okay, can we reduce the time that we spend on the rooms, because it's not important, or can we reduce the time that we spend on the interior or reduce the amount of products that we use for freshness. If those are not important.   So really, to then optimize how do we spend the resources on delivering the the car detailing jobs.   So in the next slide.   You can see a little bit with the faces of the study where Zhiwu Liang Yeah, so as popular as sad as the cart. The planning job company. We are very focused on the consumer satisfaction. So for this particular job.   What we have to do is identify what is the key factors which drive the consumer overall satisfaction and clean and shine satisfaction. So in order to do that we separate or our study design and   Data collection experiments industry step. First, we do the Pilar, which is designed to five different scenario. Okay, using the fire cars.   To set up the different level of each offer factors as a moment. We said, have to all of these five factor previous public described in the to level one is low and not as high.   Then we recruit the 20 consumers to evaluate all of the five cards in a different order. The main objective for this Pilar is check the methodology and track the   If the question we asked consumers consumers understand and provide the correct answer, and also define the proper range of each factor.   So after that, we go to the phase one, which is the extend our design space by seven factors. Okay. Some factors to keep, low, high level as we do the Pilar. Some extent to the low, medium, high because we think is more relevant to the consumer by including the more level in our factor.   And since we got more factor and from the customer design point of view, you will generate more   Experiments runs in our study so totally we have an it runs of the cost setting and each of the panelists. We ask them to   Evaluate still five but using the different order or different combination therefore accepted the custom design. When the consumer need to evaluate   Five out of the 90 I said him. We have to using the balance in company blog design technique and to use 120 customers, each of them evaluate five cars.   So totally this   120 customer data we collect we run the model identify what is the main effect. Okay, and what is the interaction in our model.   Then through that we three hours. Not important factor and go to the face to face to using the funnel identified the six factors and for course adding more level for some factor because we saying that low is not low enough in the faith from phase one study and middle it's not   Really matched to our consumer satisfaction. So we had some level of quality lol some factor level for the middle, high   Inserting currently design space, then   The face to design experiments his argument from the phase one.   Was that we get a different okay setting for the 90 different cars, then asked 120 consumer evaluate five in a different camp one   Through that we can remove non we can identify okay what is, what is the past.   Factor setting which have the optimal solution for the consumer satisfaction and clean and shine satisfaction. So as you can see here   We run the 3D model using our   Six factors setting.   Which each of them has played some role for the consumer satisfaction intense or cleaning as shine satisfaction.   For the overall. Clearly, we can see the cleaning ring and shine window cleaning in interior is the key driver for the overall satisfaction. So if consumers in the ring clean and window shine. Normally, either we all agree, he was satisfied for the   For our car detailing job and also we identified significant interaction.   Exterior clean and intuitive clean these two things combined together has different rate to the overall satisfaction with a clean satisfaction and the shine satisfaction model.   We identified very close, very significantly impact factors for clean   Clearly, all of the clean factor relate to the clean satisfaction and for shine also all of the shine one relate to the shine satisfaction.   But still the different perspective lighter clean his focus on the ring and shine is focused on the window. So from validating, we can have the better setting for all the car.   Relief factor which helpers to divide the new projects which achieved the best consumer satisfaction based on all of the   Factors setting. I think
Phil Kay, JMP Senior Systems Engineer, SAS   People and organizations make expensive mistakes when they fail to explore their data. Decision makers cause untold damage through ignorance of statistical effects when they limit their analysis to simple summary tables. In this presentation you will hear how one charity wasted billions of dollars in this way. You will learn how you can easily avoid these traps by looking at your data from many angles. An example from media reports on "best places to live" will show why you need to look beyond headline results. And how simple visual exploration - interactive maps, trends and bubble plots - gives a richer understanding. All of this will be presented entirely through JMP Public, showcasing the latest capabilities of JMP Live.   In September 2017 the New York Times reported that Craven was the happiest area of the UK. Because this is an area that I know very well, I decided to take a look at the data. What I found was much more interesting than the media reports and was a great illustration of the small sample fallacy.   This story is all about the value of being able to explore data in many different ways. And how you can explore these interactive analyses and source the data through JMP Public. Hence, "see fer yer sen", which translates from the local Yorkshire dialect as "see for yourself".   If you want to find out more about this data exploration, read these two blogs posts: The happy place?  Crisis in Craven? An update on the UK happiness survey    (view in My Videos)     This and more interactive reports used in this presentations can be found here in JMP Public.
Monday, March 9, 2020
Laura Lancaster, JMP Principal Research Statistician Developer, SAS Jianfeng Ding, JMP Senior Research Statistician Developer, SAS Annie Zangi, JMP Senior Research Statistician Developer, SAS   JMP has several new quality platforms and features – modernized process capability in Distribution, CUSUM Control Chart and Model Driven Multivariate Control Chart – that make quality analysis easier and more effective than ever. The long-standing Distribution platform has been updated for JMP 15 with a more modern and feature-rich process capability report that now matches the capability reports in Process Capability and Control Chart Builder. We will demonstrate how the new process capability features in Distribution make capability analysis easier with an integrated process improvement approach. The CUSUM Control Chart platform was designed to help users detect small shifts in their process over time, such as gradual drift, where Shewhart charts can be less effective. We will demonstrate how to use the CUSUM Control Chart platform and use average run length to assess the chart performance. The Model Driven Multivariate Control Chart (MDMCC) platform, new in JMP 15, was designed for users who monitor large amounts of highly correlated process variables. We will demonstrate how MDMCC can be used in conjunction with the PCA and PLS platforms to monitor multivariate process variation over time, give advanced warnings of process shifts and suggest probable causes of process changes.
The purpose of this poster presentation is to display COVID-19 morbidity and mortality data available on-line from Our World in Data whose contributors ask the key question: “How many tests to find one COVID-19 case?” We use SAS JMP Analyze to help answer the question. Smoothing test data from Our World in Data, yields seven-day moving average or SMA(7) total tests per thousand in five countries for which coronavirus test data are reported: Belgium, Italy, South Korea, the United Kingdom and United States. Similarly, seven-day moving average or SMA(7) total cases per million of were derived using the Time Series Smoothing option. Coronavirus tests per case were calculated by dividing smoothed total tests by smoothed total cases and multiplying by a factor of 1,000. These ratios of smoothed tests to smoothed cases were themselves smoothed. Additionally, Box-Jenkins ARIMA(1,1,1) time series models were fitted to smoothed total deaths per million to graphically compare smoothed case-fatality rates with smoothed tests per case ratios.    (view in My Videos)   Auto generated transcript:     Auto-generated transcript...   Speaker Transcript Douglas Okamoto In our poster presentation we display COVID-19 data available from our world and data, who's database sponsors, ask the question why is data on testing important We use JMP version. To help us answer the question. Seven Day moving averages are calculated from January 21 to July 21 Daily per capita COVID-19 tests and coronavirus tests in seven countries United States, Italy, Spain, Germany, Great Britain, Belgium and South Korea. Core by owners test per case where calculated by dividing smooth test by smooth cases and multiplying by a factor 1000 Daily COVID-19 test data yields smoothed test data per thousand in Figure one Testing in LA states in blue trims upward with two tests per thousand daily on July 21st 10 times more than South Korea in red. Which trends downward The x axis is normalized the figure one, two days since moving averages number one or more tests per thousand. In figure two smooth coronavirus cases per million in Europe and South Korea trend downward after peaking months earlier than the US in blue, which averaged 2200 cases per month million on July 21st, with no end in sight. The x axis is normalized to the number of days since moving averages of 10 or more cases per million. Combining tabular results from figure one and figure to smooth COVID-19 test per case in Figure three shows South Korean testing in red peaks at 685 tests per case in May 38 times USP performance in lieu Of 22 tests per case in June. Since the x axis is dated figure three represents a time series. The reciprocal of tests for case cases protest is a measure of product to a positivity one in 22 or 4.5% of positive cases in the US compares with 0.15% positivity in South Korea. And 0.5 to 1.0% in Europe. At a March 30 who press briefing. Dr. Michael Ryan suggested a positive rate less than 10% or even better, less than 3% as a general benchmark of adequate testing. JMP analysis JMP analyzed was used to fit Box Jenkins time series models to smooth test per case in the US for March 13 of April 25 predictive values from April 26 two main ninth or forecast from a fitted model and auto-regressive integrated moving average or ARIMA 111 Model the figure for a time surge of smooth tests per case from mid March to April shows a rise in the number of us test for case not a decline as predicted during the 14 day forecast period. In summary, 10 or more test cases tests were performed per case to provide adequate testing in the United States COVID-19 testing in Europe and South Korea was more than adequate with hundreds of tests per case. Equivalent only the positive rate or number of cases protest was less than 10% in the US. Whereas positivity in Europe and South Korea was well under 3% When our poster was submitted the US totaled 4 million coronavirus cases more than your European countries and South Korea combined Us continues to be plagued by state by state disease outbreaks. Thank you.  
Christian Stopp, JMP Systems Engineer, SAS Don McCormack, Principal Systems Engineer, SAS   Generations of fans have argued as to who the best Major League Baseball (MLB) players have been and why, oft citing whichever performance measures best supported their case. Whether the measures were statistics of a particular season (e.g., most home runs) or cumulative of their career (e.g., lifetime batting average), such statistics do not fully relate a player’s performance trajectory. As the arguments progress, it would be beneficial to capture the inherent growth and decay of player performance over one’s career and distill that information with minimal loss. JMP’s Functional Data Explorer (FDE) has opened doors to new ways of analyzing series data and capturing ‘traces’ associated with these functional data for analysis. We will explore FDE’S application in examining player career performance based on historical MLB data. With the derived scores we will see how well we can predict whether a player deserves their plaque in the Hall of Fame…or is deserving and has been overlooked, as well as compare these predictions with those based solely on the statistics of yore. We’ll confirm Ted Williams really was the greatest MLB hitter of all time. What, you disagree?! Must be a Yankees fan…     Auto-generated transcript...   Speaker Transcript Christian So thank you, folks, for joining us here today at the JMP Discovery Summit, the virtual version. My name is Christian Stopp. I am a JMP systems engineer. And I'm joined today by my colleague Don McCormack, who's a principal systems engineer for JMP as well. And you probably got here because you saw the title of the talk. And you saw this was...you're a baseball fan about Major League Baseball players and wanted in or you saw it was about functional data explorer and you wanted to learn a little bit more about how to employ functional data explorer in different environments. So we're going to marry those two topics today. Don and I and I'm going to gear my conversation a little more for the baseball fans first. Just as we're having kind of common conversations among baseball players and baseball fans, you might think about how your favorite player does relative to other players and you might have with your friends, these conversations and hopefully they're kept, you know, polite about about who your favorite player is and why. And so that's kind of how I imagined this infamous conversation between Alex Rodriguez and Varitek going was just about who...comparing notes about who their favorite player was. And so for me, my origin started off, and like Don's, with respect to just be having a love for baseball and being interested in the baseball statistics that you'll find in the back of the bubble gum cards we used to collect. And so as you have these conversations about who your favorite player is, you might note that players differ with respect to how good they are, but also different things like when they age... as they age, where they peak, like where the performance starts to go off over time. And so as you're thinking about maybe like me the career trajectories of these players, you might want to question, Well, how do I capture or model that performance over time? Now, if you're oddly like me, you decide that you want to pursue statistics so that you can do exactly that. But I would encourage you to skip that route and be smarter than me and just use a tool like functional data explorer to help you turn those statistics...statistical curves into numbers to use for your endeavors. So for those of you who are a little less familiar with baseball, but what we'll be seeing is data reflecting things that are measures of baseball performance. So I'm going to be speaking about position players and position players bat. And so one of the metrics of their batting prowess is on-base percentage plus slugging percentage or OPS. And so on the Y axis, I've got that that measure for a couple of different players as they age. And the blue is Babe Ruth and the red is Ted Williams. And as you can see, you get a sense from these trajectories that they both appear to have about the same quality of performance over most of their careers. But you might know that where they peak might seem to be a little at an older age for Ted Williams, as opposed to maybe Babe Ruth. And Babe Ruth, it looks like he maybe needed just a little bit of time to just get up to speed to get to that measure if you're just looking at this plot without any other knowledge. So there's a lot of...this is just two players in the thousands of players or tens of thousands that you might be considering and just look at comparing, you can imagine there's a lot of variability about these characteristics of their career trajectories. So there's also clearly variability within a player's trajectory, too. So I might use the smoothing function of the Graph Builder here and just smooth out the noise associated with those curves a little bit, to get a better sense of the signal about that player's trajectory. And it turns out that that smoothing is is very similar to what's going on in that process that functional data explorer employs. So here I've got functional data explorer and again I'm...my metric here is on-base percentage plus slugging percentage, OPS. And I'm looking just to see...like we're comparing these these player trajectories, now, in FDE is, functional data explorer, is smoothing out those player curves, as you can see, and then extracting information about what's common across those curves. And so for every player now, what you get in return for doing that is, are scores that are associated with that player's performance. And so these scores describe the player's career trajectory in a nice little quantitative way for us to take away and use another analyses like we'll be doing. So it's just, you can see that a little bit, these are Hank Aaron scores. And in the profiler that you'll...that you can access in the functional data explorer, you can actually change...you can look at that trajectory here for that player's OPS over age and then change those values to reflect what that player's scores are and get a better...replicate their their career trajectory with those scores. Right, so that's a little bit about FDE and and how to employ it here. So you'll see Don and I talking about these statistics that we're now equipped with, these player scores that we get out from the functional data explorer, that gets it from those curves that we started off with. And so we're going to use that...some what we're doing is predicting like maybe Hall of Fame status. And not only who's in the Hall of Fame that they belong there, or more more interestingly, like maybe, who are the players who are in the Hall of Fame that maybe shouldn't be because the stats don't support it or maybe identify players who the Hall of Fame committee seems to have snubbed. So we'll talk a little bit about just the different metrics that we used and how we kind of revised them. And then taking those career trajectories using FDE and then getting the scores out and doing the prediction, like we normally would with other things. So if you haven't followed baseball, the Hall of Fame eligibility...eligibility requirements are that a player had to play at least 10 years, so 10 seasons, and had to wait...you have to wait five years before you're eligible. And then you have 10 years during which you're eligible and folks can vote you in. So there's a couple of players we'll see that are still have...that are still waiting for the call. The hall uses a different selection criteria are primarily around how well the player performed, but also take into account these other things that the data source we're using, Lahman Database, doesn't include, so it's hard to measure. So we're just stick with analyses that reflect their statistical prowess on the on the field. And of course after, you know, 150 years of baseball players playing baseball, you might recognize that they're playing in different eras. And so we want to make sure that we're comparing the players to their peers. And so we're going to take that, you know, maybe the year that they played into account, or the position that they played since different requirements are associated...would typically be associated with different positions. And then different leagues have different rules; we'll weigh that in, too. That's where I'm gonna stop. Don's gonna kick over to pitching and then I'll come back and talk about position players. donmccormack So like Christian said, I'm going to talk a little bit about pitching but while I'm doing that, before that I'm doing that, what I would like to do is, I would like to illustrate some of those initial points that that Christian mentioned. The things that are good data analytic techniques, things that really need to be done, regardless of what modeling technique that you use, however, it turns out that they are good things to do before you model your data using FDE. I'm going to talk specifically about cleaning the data, about normalizing the data, so you can compare people equally, and then finally modeling the data. So as an illustration, what I've got...what you see on the screen right now, we are looking at three very different pitchers that are all in the Hall of Fame. The red line is Nolan Ryan, a very long career, about a 27-year career. The green line, the middle line, that's Hoyt Wilhelm. For some of you younger folks, you might not know who Hoyt Wilhelm; he pitched starting in the early 50s through 72, I believe. Fairly long career; spanned multiple eras. He was mostly a reliever but not a reliever like you might know of the relievers today. He's a guy who when he went out to relieve, yYou know, he might pitch six innings. Okay, so very, very atypical from the relievers today. The blue line is Trevor Hoffman, great closer for the San Diego Padres. But again, very different pitcher. So question is, I mean, what do we do, how do we get this data ready and set up in such a way where we can compare all three of these people equally? So first thing I mentioned is we want to clean up your data. And by the way, I'm going to use four different metrics. I'm going to use WHIP (walks and hits per innings pitched), strikeouts per nine, home runs per nine and a metric I've easily created called percent batters faced over the minimum, where I've just taken the number of batters a pitcher's faced divided by the total outs that they've gotten and subtracted one. The idea here is that if every batter that was faced made it out, then that would be a perfect one. Okay, I'm going to look at those four metrics. I've got different criteria in terms of how I define my normalization, in terms of how I am screening outliers and I'm going to include a PowerPoint deck for you to look at to get the details, but I'm not going to talk about them here for the sake of time. So first thing I'm gonna do is going to clean up the data. So you'll notice that, for example, that very first year Nolan Ryan pitched three innings pitched; very, very high WHIP. As a couple of seasons in here, I think that Trevor Hoffman pitched a low amount. So, so I'm going to start by excluding the data. That's nice. It's shrunk the range and it's always good to get out, get the outliers out of the data before you do the analysis. One other step that that I want to mention is that when I did FDE, when I used FDE on this data within the platform, it allows you to do some additional outlier screening where, even if you have multiple of columns that you're using, you only are screen...you're not screening out the entire row; you're only screening out the values for that given input, which is a very, very nice feature. So I use that as well because there were still, even with the my initial screening, there was still a few anomalies that I needed to get rid of. clean the data. Normalize it is the second. So by normalization, what I've done is, I basically normalized on the X axis. And I've normalized on the Y axis. So, what we're looking at here is the number of seasons. So each one of these seasons is taken as a separate whole entity, but we all know that in some seasons, some pitchers throw more innings than other seasons. So rather than looking at seasons as my entities, I'm going to look at the cumulative percent of career outs. So I know that, I know that at the end of the season pitchers made so many cumulative career outs, and that's a certain proportion out there, whole or total career outs. So I'm going to use that to scale my data. Now the great thing about that is, you'll notice that now all three pitchers are on the same x scale. Everything, everything is scaled from zero to one. So, so, really nice... from the standpoint of FDE analysis, a really nice thing to have. And then finally, I want to scale on the Y axis as well. And all I've done is I've divided the WHIP by the average WHIP for the pitcher type and for the era that they pitched in. So I have a relative WHIP. Now the other nice thing about about using these relative values is that I know where my line in the sand is. I know that a pitcher that has a relative WHIP of one is is an average pitcher. So in this case, I'm going to be looking for those guys that throw with WHIPs under one. So you'll notice that all three of these pitchers for the most of their career, they were under that that line at one. Now the final thing I'm going to do, is I want to use my FDE to model that trajectory, the trajectory. Now, one of the problems with using the data as is, the two problems with using the data, as is. One is that it's pretty bumpy, and it would be really hard to to estimate what the career trajectory is with all of these ups and downs. Second thing is, eventually what I want to do is, I want to use that metric that I've generated from the FDE, this trajectory to come up with some overall career estimate. So rather than looking at my seasons or at my cumulative percent as discrete entities, I want to be able to model that over entire continuous career. And we'll see that a little bit later on. So I am going to replace my percent my...I'm sorry about...conditional...my WHIP, my my relative WHIP with this conditional FDE estimate. Now, you might have seen me flip back to those two, you might say, oh boy, that is what a...what a what a huge difference between the two, is that really doing a good job? Kind of hard to tell from that graph. So, so what I, what I want to do is I'm going to actually show you what that looks like. So here what I've done is I pulled up the, the, the, the discrete values. This is Nolan Ryan, by the way. The discrete measurements for Nolan Ryan, along with his curve for his for his conditional FDE estimate, you'll see that it doesn't follow the same jagged path or bumpy path, but it does a good job estimating what his career trajectory is. And in general with his WHIP high, at first, he walked a lot of people, was a very, very wild pitcher, much more wild in the beginning part of his career, believe it or not. But as his career went on, that got better. And this is you'll see this in in any of the pitchers that I that I picked. So for example, if I go to, let's go to Hoyt Wilhelm. Here's Wilhelm. Again it doesn't capture the absolute highs and lows, but it does a good job at modeling the general direction of where, of where his career went. Okay, so let's let's use that to ask. I only have a limited amount of time. I wish I had more time because there's just some neat things I can show you. But I'm...I'm going to start with what I call the snubbed. Okay so these are the players that...so I used FDE on those four metrics I'd mentioned. I use those as inputs, along with the pitcher type and I tried a whole bunch of predictive modeling techniques. The two that that worked the best for me were naive Bayes and discriminate analysis. And I use those two modeling techniques to tell me who got in...who should be in and and and and who shouldn't be in and and that's what...what we're looking at here is, we're looking at those pitchers where both the naive Bayes and the discriminate analysis said yes, but the Hall of Fame said no. So these are my...this snubbed. So you'll notice that in this case...and let me switch to this. This is the apps. This is the relative WHIP. Let's go with the conditional WHIP. And let me go ahead and put that reference line back in there at one and you'll see, for the most part, these are pitchers, who spent the top...the bulk of their career under that one line. Now the other thing that you might might think of, looking at this data, is that wow, it would be really hard to tell these players apart. How do I compare these now, if I if I were to put, let's say, a few pitchers that were in the hall in this list, too. I mean, they would be...it'd be hard to separate them just by eyeballing them, because some of their career, they would be better than others, and they would switch on other parts of their careers. How do I, how do I deal with this on a career level? So as I mentioned earlier, one of the nice things about functional data explorer is that I can take that data, and I can I can I can create a career trajectory. Estimate a whole bunch of data points along that career trajectory. And I did that I actually broke up careers into 100 units and I summed over all those hundred units for each one of my curves. So basically, what I did is I got something like an area under the curve. If it were above that one line, I'd subtract, if it were below that one line, I would...I'm sorry...if I were above that line, I would add; below the line I would subtract. And if we look at total career trajectories...this is a, this is actually...this is a larger list. This is approximately 1300 or 1400 pitchers, so absolutely everyone who was... absolutely everyone who was Hall eligible, 10 years or more. So let's let's really quickly go into a couple of things we can do with this. Let's start...let me start out by by looking at the players that were snubbed. So these are...these are our player...this these are my players that were snubbed. So okay, so these are 100 values. So, so, so the line in the sand here would be 100 because I've got 100 different values I've measured. So you'll notice that for the most part, these players were above 100. Here's the list of, of the, of the players that didn't make the list. And if you take a look at these players, you'll notice that there are a couple of guys in here that are obvious. People like Curt Schilling and the and the and the Roger Clemens for non non non career reasons for the for the for the for the for the...that some of the other criteria that Christian mentioned,are not in there. But there's some guys, for example, Clayton Kershaw who's still not done with his career. But there certainly are other people who you might consider..that are Hall eligible. So let's actually, let's look at that, too. So let's look at those folks who are who are Hall eligible, who have not been in the hall... BJ Ryan; again Curt Schilling is in there; Johan Santana, not sure why he didn't make it in the hall; Smokey Joe Wood, pitcher from the early part of the 1900s; and so on. So, the ability for FDE to allow me to extract values from anywhere along their career trajectory is is is an extra tool for me to be able to estimate some additional criteria, in terms of who belongs in the hall and who doesn't belong in the hall. So, enough said about the pitchers. Let's...I'm gonna turn it back to Christian so we can talk a little bit more about the position players. Christian Excellent. Thank you, Don. Right. Okay so Don was talking about the pitching...the pitchers and so I'm looking, I'll be looking at the position donmccormack players, and so there's two different components that go into that. Christian You have your, your batting prowess, as well as your fielding prowess and I took a little different take than than Don did, with respect to just looking at the statistics and then building models. I ended up starting off with just four of the more common batting statistics, and those are the first four on the list here, some of what you'd find the backs of baseball cards. And then as I was progressing, as we'll see, I needed something to capture stolen bases, because the first four don't really...don't do that at all. And so I created a metric I call the base unit average that brings into other base runner movements that...to give credit to the batter for those things. And then the fielding, of course, is a factor as well as we'll see, so I included a couple of metrics for fielding. And so like Don like just mentioned earlier, I wanted to make sure I compared apples to apples, so I'm looking at with reference to position and league and year for this those statistics I mentioned. And then when like Don, I wanted to make sure I I weighted those smaller sample sizes appropriately so they weren't gumming up the system. And so I ended up weighting players' performance relative to the number of plate appearances relative to kind of the average for that league year at a particular lineup slot on how many plate appearances that slot should get over the course of the season. So that's how they're weighted. Right. So let's, let's see what that looks like. We're going to go back and visit Ted Williams again here. So we've got Tim Williams' career on the left here, we saw, and these are the raw scores. And then it looks like he had a really poor season here. But if, once you take a relative component of that, you can see it's actually an average season like Don, it's still above that average line of one. And so it was just a kind of a poor season for Ted Williams on his own standards. And then we saw earlier that these two peaks for Ted Williams might have resembled were his peak performance, but it turns out that those are seasons where he had smaller numbers of samples...of played appearances due to his being...going off to the Korean War. So he ended up having that impact his scores. I weighted accordingly back toward the average again because of the smaller sample sizes. So, that's how we, the types of data. I'm going to focus on just the relative statistics in my conversation here and just focus on some of the things that caught my eye. There we go. And we'll do that. Need the table of numbers here to feed in from the FDE. So here's the scores that we're going to be looking at, the relative FPC scores from the FDE. And what the first thing I saw, I included a four variables in my model, those first four batting statistics, and I wanted to just make sure I had the right components in my, in my analysis. So on the left axis here is the model-driven probability of being in the Hall of Fame. Now what...excuse me, that's the y axis, on the x axis is whether or not the person actually is in the Hall of Fame. And so my misclassification areas are these two sections here. And I noted that there were some players down here more than I was kind of expecting. So I was exploring and we might explore variables that I didn't yet include, like stolen bases. And so I'll pop those in for color and size. And as you can see, it seemed pretty clear to me that stolen bases is definitely a factor that the Hall of Fame voters were taking into account. These are... so the color and size are relative to the number of stolen bases over their career. And this is what drove me to create that base unit average statistic that I then used. So adding in...as I was exploring those models I as I described, I started off with four statistics and then added in that BUA statistic. This is my x axis now. And then I added in fielding statistics and we what we have here is a parallel plot, where the y axis is again...is a probability of the model suggesting the player should be in the Hall of Fame and each of the lines now is a player. And so the color represents their Hall of Fame status. Red is yes, there were already admitted, and blue is no. And so I like this plot because it allows me to look at to see who's moving. If I can see the impact of those additional variables in the model. And of course the first thing that caught my eye was this guy here, that how it popped up from being a not really to adding the stolen base component, and we can see that he's a high probability being elected to Hall of Fame and so belonging, depending on how you look at it. And it's Ricky Henderson, who happens to be the career leader of stolen bases. Now another player, and just looking at the defensive side of things, is Kirby Puckett, whose initial statistics suggest that, based on the initial model, that he makes it; he's qualified sufficiently just across the line. But then, you know, back if you add in the stolen base component, yeah, he actually doesn't seem to qualify any longer. And then finally, we put in the fact that he's, he's a really good fielder, he won a number of golden gloves playing the center field for the twins, we see that he's back in the good graces of the Hall of Fame committee, and rightfully voted in. This is kind of a messy model. Not messy model and you did a lot of stuff going on here. So, I ended up adding in my local data filter so I could kind of look at each position individually. And here for first base, it's a lot easier to see that we have the, the folks in red and then in blue. Now we've got somebody here, this is Todd Helton who, at least in all the models that we were looking at suggest that he should be admitted to Hall of Fame and he's still eligible. So he's still waiting to the call. But someone like Dick Allen, there's also blue, not in. His numbers, at least based on the the summary stats, the FDE statistics that we're using and the models suggest he shouldn't...he belongs in the Hall. And there are other folks who are red, down in the bottom, like Jim Thome, who the models suggest he doesn't really belong, but he was voted in. So, different ways of exploring those different relationships among, as we add in those predictors. Now, like Don, I wanted to get a sense of, well, who's, who was snubbed and who might have been gifted or at least had, you know, non statistically oriented components to his consideration. And so I, like Don, running a number of models and settled on four models that I was, I liked and did the best job of... predictive job, and like Don, rather than just using age in my FDE as the x axis, I also based it on a cumulative percent played appearances. And so that would...having these two different variants gave me a number of models to look at. And so I drilled down to just the folks who across all eight models, are in the Hall of Fame, but none of the models suggest they should be. And that's this line here. There's 31 of those. And the reverse side I have in green here, the folks who the models in either of the buckets...the majority the models in either bucket of age versus Kimball diff percent of plate appearances suggest they do belong in the Hall of Fame, but they're not. So I pulled all these folks out and just like Don wanted to, just compare what what are their trajectories look like and is there...are they close at least, or is there something else going on here? And so you can see from the this is the on on base percentage plus slugging percentage, OPS, again. It certainly looks like, in red and the plus signs, that the folks who were snubbed performed a lot better on this metric, and as it turns out, every other offensive stat metric better than the gifted folks, the folks who are in, but the model suggests shouldn't be. And that made me think, Well, is it, is it just the offensive stats that are and maybe the fielding is where the, the, the folks who were in already shine? And based on what at least fielding percentage, it actually suggests that there that still is the case, where... actually this is this snubbed folks. The, the gifted folks still look like they were... they don't necessarily belong as much as the these snubbed folks do. It was only on the range factor component where the tide reversed. And so you end up seeing the gifted folks outweigh the snubbed folks who performed better. That's another different take, much like Don's, that you can use to evaluate just what the components are included in your model. A lot of different ways we can look at the data here. So just wrapping up because I'm sure some of you are just burning to know who is snubbed and who is gifted among those folks. These are some of the folks that were snubbed, at least among the position players and, like Don mentioned for some of his pitchers, there's a few of these folks who are banned from baseball, so they're not exactly snubbed, so. you probably recognize some of these. And then these are some of the players who were gifted, or at least it the criteria of their statistics alone is...it may not have been what got them in the Hall of Fame. Right, so just wrapping up where we've been, we've been able to take those player career trajectories of their performance on...pick a metric and put that into the functional data explorer and get out numerical summaries that capture the essence of those curves. And then, in turn, use those statistics those scores that we get to be able to put those in our traditional statistic techniques that we're familiar with. And so now we can change that question from how you model or quantify career trajectory and revise it to a question of what do I want to explore with these FPC scores I've got? So we hope you enjoyed talking about baseball and just that interaction to baseball and JMP and FDE. And hope you feel empowered to go and take the FDE tool that's available in JMP Pro to address questions with data like who your favorite player is and why, and have the means of backing it up. Thanks for joining us. Take care. donmccormack Okay, so how do we deal with these cases where we need to look at somebody's career trajectory? Are there other metrics where we can make these comparisons, so that we could tell these really fine gradations apart? So as I as I alluded to earlier, what we could do is we could we could certainly we could we could look at absolutely any point along the along the person's career trajectory with any amount of gradation that we want to. And I did that. I took 100 data points, 100 values between zero and one, start of the career, end of the career, and I summed up over all those values. And I did this...the nice thing about this technique is that I can do it for multiple metrics. So, so now what we're looking at here is we are looking at, we're looking at a plot of all four metrics. We can plot them all on one graph. We're going to go back again to that group of folks that were that were snubbed, these folks here. So that's so...so if we take a look at these folks, we see that they had a low...by the way, 100 in this case because there were 100 observations. hundred home runs per nine, you want that low; percent batters faced over the minimum, low; and then the strikeouts over nine innings, you want on the high side. You'll notice that that's kind of the trajectory that folks follow. Now then, the interesting thing about this point is, that what I can do is, I can use any criteria that I want to. So for example, let's say I'm going to look at...I'm going to consider all my players and I only want to consider those people who had A WHIP that was below, in this case, 100...so better than that...that's actually that's...even make it better than that. Let's say 90 or below. Okay, so let's look at those folks who, you know, at least have the average number of strikeouts per nine innings, and maybe their batters per...percent batters faced over 100 is at a minimum. And so, and I'll disregard home runs for nine here. I also, you could also standardize and normalize by the number of seasons and I've done that exactly. So what I want to do is I want to look at those players that maybe only have 10 season equivalents, where a season equivalent is based on what was the average player season like. All right. And then finally, what kind of workload they had over their, their entire career. And let's say we want somebody who had at least 80%, let's make a little bit more stricter, let's say, let's say, about the same workload. And again, we can use different criteria to weed out those folks who we don't think we should consider and those folks who we do think we consider and then using those criteria... I also want to say let's let's take a look at those folks that are not in the Hall of Fame. So here we go. Now we have a list of people who are worth considering. And you'll notice that they're they're quite a few folks folks that probably shouldn't surprise you. These are folks that are either not in the hall yet because they're still playing or just have been disregarded know, Chris Sale, for example, is still pitching. Curt Schilling, for obvious reasons is not the hall. Johan Santana, why, why isn't he in the hall? He was actually part of that group that that that were snubbed. So the nice thing about using these FDEs is that you can take them, turn them into your career trajectories, and then use an additional metric to be able to determine hall worthiness and non Hall worthiness.  
Steve Hampton, Process Control Manager, PCC Structurals Jordan Hiller, JMP Senior Systems Engineer, JMP   Many manufacturing processes produce streams of sensor data that reflect the health of the process. In our business case, thermocouple curves are key process variables in a manufacturing plant. The process produces a series of sensor measurements over time, forming a functional curve for each manufacturing run. These curves have complex shapes, and blunt univariate summary statistics do not capture key shifts in the process. Traditional SPC methods can only use point measures, missing much of the richness and nuance present in the sensor streams. Forcing functional sensor streams into traditional SPC methods leaves valuable data on the table, reducing the business value of collecting this data in the first place. This discrepancy was the motivator for us to explore new techniques for SPC with sensor stream data. In this presentation, we discuss two tools in JMP — the Functional Data Explorer and the Model Driven Multivariate Control Chart — and how together they can be used to apply SPC methods to the complex functional curves that are produced by sensors over time. Using the business case data, we explore different approaches and suggest best practices, areas for future work and software development.     Auto-generated transcript...   Speaker Transcript Jordan Hiller Hi everybody. I'm Jordan Hiller, senior systems engineer at JMP, and I'm presenting with Steve Hampton, process control manager at PCC Structurals. Today we're talking about statistical process control for process variables that have a functional form.   And that's a nice picture right there on the title   slide. We're talking about statistical process control, when it's not a single number, a point measure, but instead, the thing that we're trying to control has the shape of a functional curve.   Steve's going to talk through the business case, why we're interested in that in a few minutes. I'm just going to say a few words about methodology.   We reviewed the literature in this area for the last 20 years or so. There are many, many papers on this topic. However, there doesn't really appear to be a clear consensus about the best way to approach this statistical   process control   when your variables take the form of a curve. So we were inspired by some recent developments in JMP, specifically the model driven multivariate control chart introduced in JMP 15 and the functional data explorer introduced in JMP 14.   Multivariate control charts are not really a new technique they've been around for a long time. They just got a facelift in JMP recently.   And they use either principal components or partial least squares to reduce data, to model and reduce many, many process variables so that you can look at them with a single chart. We're going to focus on the on the PCA case, we're not really going to talk about partial   the   partial least squares here.   Functional Data Explorer is the method we use in JMP in order to work with data in the shape of a curve, functional   data. And it uses a form of principal components analysis, an extension of principal components analysis for functional data.   So it was a very natural kind of idea to say what if we take our functional curves, reduce and model that using the functional data explorer.   The result of that is functional principal components and just as you you would add regular principal components and push that through a model driven multivariate control chart,   what if we could do that with a functional principal components? Would that be feasible and would that be useful?   So with that, I'll turn things over to Steve and he will introduce the business case that we're going to discuss today. 1253****529 All right. Thank you very much. Jordan.   Since I do not have video, I decided to let you guys know what I look like.   There's me with my wife Megan and my son Ethan   with last year's pumpkin patch. So I wanted to step into the case study with a little background on   what I do, and so you have an idea of where this information is coming from. I work in investment casting for precision casting...   Investment Casting Division.   Investment casting involves making a wax replicate of what you want to sell, putting it into a pattern assembly,   dipping it multiple times in proprietary concrete until you get enough strength to be able to dewax that mold.   And we fire it to have enough strength to be able to pour metal into it. Then we knock off our concrete, we take off the excessive metal use for the casting process. We do our non destructive testing and we ship the part.   The drive for looking at improved process control methods is the fact that   Steps 7, 8, and 9 take up 75% of the standing costs because of process variability in Steps 1-6. So if we can tighten up 1-6,   most of ??? and cost go there, which is much cheaper, much shorter, then there is a large value add for the company and for our customers in making 7, 8, and 9 much smaller.   So PCC Structurals. My plant, Titanium Plant, makes mostly aerospace components. On the left there you can see a fan ??? that is glowing green from some ??? developer.   And then we have our land based products, which right there's a N155 howitzer stabilizer leg.   And just to kind of get an idea where it goes. Because every single airplane up in the sky basically has a part we make or multiple parts, this is an engine sections ???, it's about six feet in diameter, it's a one piece casting   that goes into the very front of the core of a gas turbine engine. This one in particular is for the Trent XWB that powers the Airbus A350   jets.   So let's get into JMP. So the big driver here is, as you can imagine, with something that is a complex as an investment casting process for a large part, there is tons of   data coming our way. And more and more, it's becoming functional as we increase the number of centers, we have and we increase the number of machines that we use. So in this case study, we are looking at   data that comes with a timestamp. We have 145 batches. We have our variable interest which is X1.   We have our counter, which is a way that I've normalized that timestamp, so it's easier to overlay the run in Graph Builder and also it has a little bit of added   niceness in the FTP platform. We have our period, which allows us to have that historic period and a current period that lines up with the model driven multivariate control chart platform,   so that we can have our FDE   only be looking at the historic so it's not changing as we add more current data. So this is kind of looking at this if you were in using this in practice, and then the test type is my own validation   attempts. And what you'll see here is I've mainly gone in and tagged thing as bad, marginal or good. So red is bad, marginal is purple, and green is good and you can see how they overlay.   Off the bat, you can see that we have some curvey   ??? curves from mean. These are obviously what we will call out of control or bad.   This would be what manufacturing called a disaster because, like, that would be discrepant product. So we want to be able to identify those   earlier, so that we can go look at what's going on the process and fix it. This is what it looks like   breaking out so you can see that the bad has some major deviation, sometimes of mean curve and a lot of character towards the end.   The marginal ones are not quite as deviant from the mean curves but have more bouncing towards the tail and then good one is pretty tight. You can see there's still some bouncing. So this is where the   the marginal and the good is really based upon my judgment, and I would probably fail an attribute Gage R&R based on just visually looking at this. So   we have a total of 33 bad curves, 45 marginal and 67. And manually, you can just see about 10 of them are out. So you would have an option if you didn't want to use a point estimate, which I'll show a little bit later that doesn't work that great, of maybe making...   control them by points using the counter. And how you do that would be to split the bad table by counter, put it into an individual moving range control chart through control chart building and then you would get out,   like 3500 control charts in this case, which you can use the awesome ability to make combined data tables to turn that that list summary from each one into its own data table that you can then link back to your main data table and you get a pretty cool looking   analysis that looks like this, where you have control limits based upon the counters and historic data and you can overlay your curves. So if you had an algorithm that would tag whenever it went outside the control limits, you know, that would be an option of trying to   have a control....   a control chart functionality with functional data. But you can see, especially I highlighted 38 here, that you can have some major deviation and stay within the control limits. So that's where this FDE   platform really can shine, in that it can identify an FPC that corresponds with some of these major deviations. And so we can tag the curves based upon those at FPCs.   And we'll see that little later on. So,   using the FDE platform, it's really straightforward. Here for this demonstration, we're going to focus on a step function with 100 knots.   And you can see how the FPCs capture the variability. So the main FPC is saying, you know, beginning of the curve, there's...that's what's driving the most variability, this deviation from the mean.   And setup is X1 and their output, counters. Our input, batch number and then I added test type. So we can use that as some of our validation in FPC table and the model driven multivariate control chart and the period so that only our historic is what's driving the FDE fit.   And so   just looking at the fit is actually a pretty important part of making sure you get correct   control charting later on, is I'm using this P Step   Function 100 knots model. You can see, actually, if I use a B spline and so with Cubic 20 knots, it actually looks pretty close to my P spline.   But from the BIC you can actually see that I should be going to more knots, so if I do that, now we start to see them overfitting, really focusing on the isolated peaks and it will cause you to have an FDE   model that doesn't look right and causes you to not be as sensitive and your model driven multivariate control chart.   0
Monday, October 12, 2020
Jordan Hiller, JMP Senior Systems Engineer, JMP Mia Stephens, JMP Principal Product Manager, JMP   For most data analysis tasks, a lot of time is spent up front — importing data and preparing it for analysis. Because we often work with data sets that are regularly updated, automating our work using scripted repeatable workflows can be a real time saver. There are three general sections in an automation script: data import, data curation, and analysis/reporting. While the tasks in the first and third sections are relatively straightforward — point-and click to achieve the desired result and capture the resulting script — data curation can be more challenging for those just starting out with scripting. In this talk we review common data preparation activities, discuss the JSL code necessary to automate the process, and provide advice for generating JSL code for data curation via point-and-click.     The Data Cleaning Script Assistant Add-in discussed in this talk can be found in the JMP File Exchange.     Auto-generated transcript...   Speaker Transcript mistep Welcome to JMP Discovery summit. I'm Mia Stephens and I'm a JMP product manager and I'm here with Jordan Hiller, who is a JMP systems engineer. And today we're going to talk about automating the data curation workflow. And we're going to split our talk into two parts. I'm going to kick us off and set the stage by talking about the analytic workflow and where data curation fits into this workflow. And then I'm going to turn it over to Jordan for the meat, the heart of this talk. We're going to talk about the need for reproducible data curation. We're going to see how to do this in JMP 15. And then you're going to get a sneak peek at some new functionality in JMP 16 for recording data curation steps and the actions that you take to prepare your data for analysis. So let's think about the analytic workflow. And here's one popular workflow. And of course, it all starts with defining what your business problem is, understanding the problem that you're trying to solve. Then you need to compile data. And of course, you can compile data from a number of different sources and pull these data in JMP. And at the end, we need to be able to share results and communicate our findings with others. Probably the most time-consuming part of this process is preparing our data for analysis or curating our data. So what exactly is data curation? Well, data curation is all about ensuring that our data are useful in driving analytics discoveries. Fundamentally, we want to be able to solve a problem with the day that we have. This is largely about data organization, data structure, and cleaning up data quality issues. If you think about problems or common problems with data, it generally falls within four buckets. We might have incorrect formatting, incomplete data, missing data, or dirty or messy data. And to talk about these types of issues and to illustrate how we identify these issues within our data, we're going to borrow from our course, STIPS And if you're not familiar with STIPS, STIPS is our free online course, Statistical Thinking for Industrial Problem Solving, and it's set up in seven discrete modules. Module 2 is all about exploratory data analysis. And because of the interactive and iterative nature of exploratory data analysis and data curation, the last lesson in this module is data preparation for analysis. And this is all about identifying quality issues within your data and steps you might take to curate your data. So let's talk a little bit more about the common issues. Incorrect formatting. So what do we mean by incorrect formatting? Well, this is when your data are in the wrong form or the wrong format for analysis. This can apply your data table as a whole. So, for example, you might have your data in separate columns, but for analysis, you need your data stacked in one column. This can apply to individual variables. You might have the wrong modeling type or data type or you might have date data, data on dates or times that's not formatted that way in JMP. It can also be cosmeti. You might choose to remove response variables to the beginning of the data table, rename your variables, group factors together to make it easier to find them with the data table. Incomplete data is about having a lack of data. And this can be on important variables, so you might not be capturing data on variables that can ultimately help you solve your problem or on combinations of variables. Or it could mean that you simply don't have enough observations, you don't have enough data in your data table. Missing data is when values for variables are not available. And this can take on a variety of different forms. And then finally, dirty or messy data is when you have issues with observations or variables. So your data might be incorrect. The values are simply wrong. You might have inconsistencies in terms of how people were recording data or entering data into the system. Your data might be inaccurate, might not have a capable measurement system, there might be errors or typos. The data might be obsolete. So you might have collected the information on a facility or machine that is no longer in service. It might be outdated. So the process might have changed so much since you collected the data that the data are no longer useful. The data might be censored or truncated. You might have columns that are redundant to one another. They have the same basic information content or rows that are duplicated. So dirty and messy data can take on a lot of different forms. So how do you identify potential issues? Well, when you take a look at your data, you start to identify issues. And in fact, this process is iterative and when you start to explore your data graphically, numerically, you start to see things that might be issues that you might want to fix or resolve. So a nice starting point is to start by just scanning the data table. When you scan your data table, you can see oftentimes some obvious issues. And for this example, we're going to use some data from the STIPS course called Components, and the scenario is that a company manufactures small components and they're trying to improve yield. And they've collected data on 369 batches of parts with 15 columns. So when we take a look at the data, we can see some pretty obvious issues right off the bat. If we look at the top of the data table, we look at these nice little graphs, we can see the shapes of distributions. We can see the values. So, for example, batch number, you see a histogram. And batch number is something you would think of being an identifier, rather than something that's continuous. So this can tell us that the data coded incorrectly. When we look at number scrapped, we can see the shape of the distribution. We can also see that there's a negative value there, which might not be possible. we see a histogram for process with two values, and this can tell us that we need to change the modeling type for process from continuous to nominal. You can see more when you when you take a look at the column panel. So, for example, batch number and part number are both coded as continuous. These are probably nominal And if you look at the data itself, you can see other issues. So, for example, humidity is something we would think of as being continuous, but you see a couple of observations that have value N/A. And because JMP see text, the column is coded as nominal, so this is something that you might want to fix. we can see some issues with supplier. There's a couple of missing values, some typographical errors. And notice, temperature, all of the dots indicate that we're missing values for temperature in these in these rows. So this is an issue that we might want to investigate further. So you identify a lot of issues just by scanning the data table, and you can identify even more potential issues when you when you visualize the data one variable at a time. A really nice starting point, and and I really like this tool, is the column viewer. The column viewer gives you numeric summaries for all of the variables that you've selected. So for example, here I'm missing some values. And you can see for temperature that we're missing 265 of the 369 values. So this is potentially a problem if we think the temperature is an important factor. We can also see potential issues with values that are recorded in the data table. So, for example, scrap rate and number scrap both have negative values. And if this isn't isn't physically possible, this is something that we might want to investigate back in the system that we collected the data in. Looking at some of the calculated statistics, we can also see other issues. So, for example, batch number and part number really should be categorical. It doesn't make sense to have the average batch number or the average part number. So this tells you you should probably go back to the data table and change your modeling type. Distributions tell us a lot about our data and potential issues. We can see the shapes of distributions, the centering, the spread. We can also see typos. Customer number here, the particular problem here is that there are four or five major customers and some smaller customers. If you're going to use customer number and and analysis, you might want to use recode to group some of those smaller customers together into maybe an other category. we have a bar chart for humidity, and this is because we have that N/A value in the column. And we might not have seen that when we scan the data table, but we can see it pretty clearly here when we look at the distribution. We can clearly see the typographical errors for supplier. And when we look at continuous variables, again, you can look at the shape, centering, and spread, but you can also see some unusual observations within these variables. So, after looking at the data one variable at a time, a natural, natural progression is to explore the data two or more variables at a time. So for example, if we look at scrap rate versus number scrap in the Graph Builder. We see an interest in pattern. So we see these these bands and it could be that there's something in our data table that helps us to explain why we're seeing this pattern. In fact, if we color by batch size, it makes sense to us. So where we have batches with 5000 parts, there's more of an opportunity for scrap parts than for batches of only 200. We can also see that there's some strange observations at the bottom. In fact, these are the observations that had negative values for the number of scrap and these really stand out here in this graph. And when you add a column switcher or data filter, you can add some additional dimensionality to these graphs. So I can look at pressure, for example, instead of... Well, I can look at pressure or switch to dwell. What I'm looking for here is I'm getting a sense for the general relationship between these variables and the response. And I can see that pressure looks like it has a positive relationship with scrap rate. And if I switch to dwell, I can see there's probably not much of a relationship between dwell and scrap rate or temperature. So these variables might not be as informative in solving the problem. But look at speed, speed has a negative relationship. And I've also got some unusual observations at the top that I might want to investigate. So you can learn a lot about your data just by looking at it. And of course, there are more advanced tools for exploring outliers and missing values that are really beyond the scope of this discussion. And as you get into the analyze phase, when you start analyzing your data or building models, you'll learn much much more about potential issues that you have to deal with. And the key is that as you are taking a look at your data and identifying these issues, you want to make notes of these issues. Some of them can be resolved as you're going along. So you might be able to reshape and clean your data as you proceed through the process. But you really want to make sure that you capture the steps that you take so that you can repeat the steps later if you have to repeat the analysis or if you want to repeat the analysis on new data or other data. And at this point is where I'm going to turn it over to to Jordan to talk about reproducible data curation and what this is all about. Jordan Hiller Alright thanks, Mia. That was great. And we learned what you do in JMP to accomplish data curation by point and click. Let's talk now about making that reproducible. The reason we worry about reproducibility is that your data sets get updated regularly with new data. If this was a one-time activity, we wouldn't worry too much about the point and click. But when data gets updated over and over, it is too labor-intensive to repeat the data curation by point and click each time. So it's more efficient to generate a script that performs all of your data curation steps, and you can execute that script with one click of a button and do the whole thing at once. So in addition to efficiency, it documents your process. It serves as a record of what you did. So you can refer to that later for yourself and remind yourself what you did, or for people who come after you and are responsible for this process, it's a record for them as well. For the rest of this presentation, my goal is to show you how to generate a data curation script with point and click only. We're hoping that you don't need to do any programming in order to get this done. That program code is going to be extracted and saved for you, and we'll talk a little bit about how that happens. So there are two different sections. There's what you can do now in JMP 15 to obtain a data curation script, and what you'll be doing once we release JMP 16 next year. In JMP 15 there are some data curation tasks that generate their own reusable JSL scripting code. You just execute your point and click, and then there's a technique to grab the code. I'm going to demonstrate that. So tools like recode, generating a new formula column with the calculation, reshaping data tables, these tools are in the tables menu. There's stack, split, join, concatenate, and update. All of these tools in JMP 15 generate their own script after you execute them by point and click. There are other common tasks that do not generate their own JSL script and in order to make it easier to accomplish these tasks and make them reproducible, And it helps with the following tasks, mostly column stuff, changing the data types of columns, the modeling types, changing the display format, renaming, reordering, and deleting columns from your data table, also setting column properties such as spec limits or value labels. So the Data Cleaning Script Assistant is what you'll use to assist you with those tasks in JMP 15. We are going to give you a sneak preview of JMP 16 and we're very excited about new features in the log in JMP 16, I think it's going to be called the enhanced log mode. The basic idea is that in JMP 16 you can just point and click your way through your data curation steps as usual. The JSL code that you need is generated and logged automatically. All you need to do is grab it and save it off. So super simple and really useful, excited to show that to you. Here's a cheat sheet for your reference. In JMP 15 these are the the tasks on the left, common data curation tasks; it's not an exhaustive list. And the middle column shows how you accomplish them by point and click in JMP. The method for extracting the reusable script is listed on the right. So I'm not going to cover everything in here. But yeah, this is for you for your reference later. Let's get into a demo. And I'll show how to address some of those issues that Mia identified with the components data table. I'm going to start in JMP 15. And the first thing that we're going to talk about are some of those column problems, changing changing the data types, the modeling types, that kind of thing. Now, if you were just concerned with point and click in JMP, what you would ordinarily do is, for for let's say for humidity. This is the column you'll remember that has some text in that and it's coming in mistakenly as a character column. So to fix that by point and click, you would ordinarily right click, get into the column info, and address those changes here. This is one of those JMP tasks that doesn't leave behind usable script in in JMP 15. So for this, we're going to use the data cleaning script assistant instead. So here we go. It's in the add ins menu, because I've installed it, you can install it too. Data cleaning script assistant, the tool that we need for this is Victor the cleaner. This is a graphical user interface for making changes to columns, so we can address data types and modeling types here. We can rename columns, we can change the order of columns, and delete columns, and then save off the script. So let's make some changes here. For humidity, that's the one with the the N/A values that caused it to come in as text. We're going to change it from a character variable to a numeric variable. And we're going to change it from nominal to continuous. We also identified batch number needs to come...needs to get changed to to nominal; part number as well needs to get changed to nominal and the process, which is a number right now, that should also be nominal. fab tech. So that's not useful for me. Let's delete the facility column. I'm going to select it here by clicking on its name and click Delete. Here are a couple of those cosmetic changes that Mia mentioned. Scrap rate is at the end of my table. I want to move it earlier. I'm going to move it to the fourth position after customer number. So we select it and use the arrows to move it up in the order to directly after customer number. Last change that I'm going to make is I'm going to take the pressure variable and I'm going to rename it. My engineers in my organization called this column psi. So that's the name that I want to give that column. Alright, so that's all the changes that I want to make here. I have some choices to make. I get to decide whether the script gets saved to the data table itself. That would make a little script section over here in the upper left panel. Where to save it to its own window, let's save it to a script window. You can also choose whether or not the cleaning actions you specified are executed when you click ok. Let's let's keep the execution and click OK. So now you'll see all those changes are made. Things have been rearrange, column properties have changed, etc. And we have a script. We have a script to accomplish that. It's in its own window and this little program will be the basis. We're going to build our data curation script around it. Let's let's save this. I'm going to save it to my desktop. And I'm going to call this v15 curation script. changing modeling types, changing data types, renaming things, reordering things. These all came from Victor. I'm going to document this in my code. It's a good idea to leave little comments in your code so that you can read it later. I'm going to leave a note that says this is from the Victor tool. And let's say from DCSA, for data cleaning script assistant Victor. So that's a comment. The two slashes make a line in your program; that's a comment. That means that the program interpreter won't try to execute that as program code. It's recognized as just a little note and you can see it in green up there. Good idea to leave yourself little comments in your script. All right, let's move on. The next curation task that I'm going to address is a this supplier column. Mia told us how there were some problems in here that need to be addressed. We'll use the recode tool for this. Recode is one of the tools in JMP 15 that leaves behind its own script, just have to know where to get it. So let's do our recode and grab the script, right click recode. And we're going to fix these data values. I'm going to start from the red triangle. Let's start by converting all of that text to title case, that cleaned up this lower case Hersch value down here. Let's also trim extra white space, extra space characters. That cleaned up that that leading space in this Anderson. Okay. And so all the changes that you make in the recode tool are recorded in this list and you can cycle through and undo them and redo them and cycle through that history, if you like. All right, I have just a few more changes to make. I'll make the manually. Let's group together the Hershes, group together the Coxes, group together all the Andersons. Trutna and Worley are already correct. The last thing I'm going to do is address these missing values. We'll assign them to their own category of missing. That is my recode process. I'm done with what I need to do. If I were just point and clicking, I would go ahead and click recode and I'd be done. But remember, I need to get this script. So to do that, I'm going to go to the red triangle. Down to the script section and let's save this script to a script window. Here it is saved to its own script window and I'm just going to paste that section to the bottom of my curation script in process. So let's see. I'm just going to grab everything from here. I don't even really have to look at it. Right. I don't have to be a programmer, Control C, and just paste it at the bottom. And let's leave ourselves a note that this is from the recode red triangle. Alright, and I can close this window. I no longer need it. And save these updates to my curation scripts. So that was recode and the way that you get the code for it. All right, then the next task that we're going to address is calculating a yield. Oh, I'm sorry. What I'm going to do is I'm going to actually execute that recode. Now that I've saved the script, let's execute the recode. And there it is, the recoded supplier column. Perfect. All right, let's calculate a yield column. This is a little bit redundant, I realize we already have the scrap rate, but for purposes of discussion, let's show you how you would calculate a new column and extract its script. This is another place in JMP 15 where you can easily get the script if you know where to look. So making our yield column. New column, double click up here, rename it from column 16 to yield. And let's assign it a formula. To calculate the yield, I need to find how many good units I have in each batch, so that's going to be the batch size minus the number scrapped. So that's the number of good units I have in every batch. I'm going to divide that by the total batch size and here is my yield column. Yes, you can see that yield here is .926. Scrap rate is .074, 1 minus yield. So good. The calculation is correct. Now that I've created that yield column, let's grab its script. And here's the trick, right click, copy columns. from right click, copy columns. Paste. And there it is. Add a new column to the data table. It's called yield and here's its formula. Now, I said, you don't need to know any programming, I guess here's a very small exception. You've probably noticed that there are semicolons at the end of every step in JSL. That separates different JSL expressions and if you add something new to the bottom of your script, you're going to want to make sure that there's a semicolon in between. So I'm just typing a semicolon. The copy columns function did not add the semicolon so I have to add it manually. All right, good. So that's our yield column. The next thing I'd like to address is this. My processes are labeled 1 and 2. That's not very friendly. I want to give them more descriptive labels. We're going to call Process Number 1, production; and Process Number 2, experimental. We'll do that with value labels. Value labels are an example of column properties. There's an entire list of different column properties that you can add to a column. This is things like the units of measurement. This is like if you want to change the order of display in a graph, you can use value ordering. If you want to add control limits or spec limits or a historical sigma for your quality analysis, you can do that here as well. Alright. So all of these are column properties that we add, metadata that we add to the columns. And we're going to need to use the Data Cleaning Script Assistant to access the JSL script for adding these column properties. So here's how we do it. At first, we add the column properties, as usual, by point and click. I'm going to add my value labels. Process Number 1, we're going to call production. Add. Process Number 2, we're going to call experimental. And by adding that value label column property, I now get nice labels in my data table. Instead of seeing Process 1 and Process 2, I see production and experimental. Data Cleaning Script Assistant. We will choose the property copier. A little message has popped up saying that the column property script has been copied to the clipboard and then we'll go back to our script in process. from the DCSA property copier. And then paste, Control V to paste. There is the script that we need to assign those two value labels. It's done. Very good. Okay, I have one more data curation step to go through, something else that we'll need the Data Cleaning Script Assistant for. We want to consider only, let's say, the rows in this data table where vacuum is off. Right. So there are 313 of those rows. And I just want to get rid of the rows in this data table where vacuum is on. So the way you do it by point and click is is selecting those, much as I did right now, and then running the table subset command. In order to get usable code, we're going to have to use the Data Cleaning Script Assistant once again. So here's how to subset this data table to only the rows were vacuum is off. First, I'm going to use, under the row menu, under the row selection submenu, we'll use this Select Where command in order to get some reusable script for the selection. We're going to select the rows were vacuum is off. And before clicking okay to execute that selection, again, I will go to the red triangle, save script to the script window. Control A. Control C to copy that and let's paste that at again From rows. Select Where Control V. So there's the JSL code that selects the rows where vacuum is off. Now I need, one more time, need to use the Data Cleaning Script Assistant to get the selected rows. Oh, let us first actually execute the selection. There it is. Now with the row selected, we'll go up to again add ins, Data Cleaning Script Assistant, subset selected rows. I'm being prompted to name my new data table that has the subset of the data. Let's call it a vacuum, vacuum off. That's my new data table name. Click OK, another message that the subset script has been copied to the clipboard. And so we paste it to the bottom. There it is. And this is now our complete data curation script to use in JMP 15 and let's just run through what it's like to use it in practice. I'm going to close the data table that we've been working on and making corrections to doing our curation on. Let's close it and revert back to the messy state. Make sure I'm in the right version of JMP. All right. Yes, here it is, the messy data. And let's say some new rows have come in because it's a production environment and new data is coming in all the time. I need to replay my data curation workflow. run script. It performed all of those operations. Note the value labels. Note that humidity is continuous. Note that we've subset to only the rows where vacuum is off. The entire workflow is now reproducible with a JSL script. So that's what you need to keep in mind for JMP 15. Some tools you can extract the JSL script from directly; for others, you'll use my add in, the Data Cleaning Script Assistant. And now we're going to show you just how much fun and how easy this is in JMP 16. I'm not going to work through the entire workflow over again, because it would be somewhat redundant, but let's just go through some of what we went through. Here we are in JMP 16 and I'm going to open the log. The log looks different in JMP 16 and you're going to see some of those differences presently. Let's open the the messy components data. Here it is. And you'll notice in the log that it has a section that says I've opened the messy data table. And down here. Here is that JSL script that accomplishes what we just did. So this is like a running log that that automatically captures all of the script that you need. It's not complete yet. There are new features still being added to it. And I, and I assume that will be ongoing. But already this this enhanced log feature is very, very useful and it covers most of your data curation activities. I should also mention that, right now, what I'm showing to you is the early adopter version of JMP. It's early adopter version 5. So when we fully release the production version of JMP 16 early next year, it's probably going to look a little bit different from what you're seeing right now. Alright, so let's just continue and go through some of those data curation steps again. I won't go through the whole workflow, because it would be redundant. Let's just do some of them. I'll go through some of the things we used to need Victor for. In JMP 16 we will not need the Data Cleaning Script Assistant. We just do our point and click as usual. So, humidity, we're going to change from character to numeric and from nominal to continuous and click OK. Here's what that looks like in the structured log. It has captured that JSL. All right, back to the data table. We are going to change the modeling type of batch number and part number and process from continuous to nominal. That's done. That has also been captured in the log. We're going to delete the facility column, which has only one value, right click Delete columns. That's gone. PSI. OK, so those were all of the tool...all of the things that we did in Victor in JMP 15. Here in JMP 16, all of those are leaving behind JMP script that we can just copy and reuse down here. Beautiful. All right. Just one more step I will show you. Let's show the subset to vacuum is off. Much, much simpler here in JMP 16. All we need to do is select all the off vacuums; I don't even need to use the rows menu, I can just right click one of those offs and select matching cells, that selects the 313 rows where vacuum is off. And then, as usual, to perform the subset, to select to subset to only the selected rows, table subset and we're going to create a new table called vacuum off that has only our selected rows and it's going to keep all the columns. Here we go. That's it. We just performed all of those data curation steps. Here's what it looks like in the structured log. And now to make this a reusable, reproducible data curation script, all that we need to do is come up to the red triangle, save the script to a script window. I'm going to save this to my data...to my desktop as a v16 curation script. And here it is. Here's the whole script. So let's uh let's close all the data in JMP 16 and just show you what it's like to rerun that script. Here I am back in the home window for JMP 16. Here's my curation script. You'll notice that the first line is that open command, so I don't even need to open the data table. It's going to happen in line right here. All I need to do is, when there's new data that comes in and and this file has been updated, all that I need to do to do my data curation steps is run the script. And there it is. All the curation steps and the subset to the to the 313 rows. So that is using the enhanced log in JMP 16 to capture all your data curation work and change it into a reproducible script. Alright, here's that JMP 15 cheat sheet to remind you once again, these, this is what you need to know in order to extract the reusable code when you're in JMP 15 right now, and you won't have to worry about this so much once we release JMP 16 in early 2021. So to conclude, Mia showed you how you achieve data curation in JMP. It's an exploratory and iterative process where you identify problems and fix them by point and click. When your data gets updated regularly with new data, you need to automate that workflow in order to save time And also to document your process and to leave yourself a trail of breadcrumbs when you when you come back later and look at what you did. The process of automation is translating your point and click activities into a reusable JSL script. We discussed how in JMP 15 you're going to use a combination of both built in tools and tools from the Data Cleaning Script Assistant to achieve these ends. And we also gave you a sneak preview of JMP 16 and how you can use the enhanced log to just automatically passively capture your point and click data curation activities and leave behind a beautiful reusable reproducible data curation script. All right. That is our presentation, thanks very much for your time.  
Sports analytics tools are becoming more frequently used to help athletes enhance their skills and body strength to perform better and prevent injury. ACL tearing is one of the most common and dangerous injuries in basketball history. This injury occurs most frequently in jumping, landing, and pivoting due to the rapid change of direction and/or sudden deceleration in basketball. Recovering from an ACL injury is a brutal process, can take months – even years – to recover, and significantly decrease the player’s performance after recovery. The goal of this project is to find the relationship between fatigue and different angle measurements in the hips, knees, and back as well as the force applied to the ground to minimize the ACL injury risk. 7 different sensors were attached to a test subject while he conducted the countermovement jump for 10 trials on each leg before and after 2 hours of vigorous exercise. The countermovement jump was chosen due to its ability to assess the ACL injury risk quite well through force and flexion of different body parts. Several statistical tools such as the control chart builder, multivariate correlation, and variable clustering were utilized to discover any general insights between the before and after fatigue state for each exercise (which is related to an increased ACL injury risk). The JMP Multivariate SPC platform provided further biomechanic, time-specific information about how joint flexions differ before and after fatigue at specific time points, giving a more in-depth understanding of how the different joint contributions change when fatigued. The end-to-end experimental and analysis approach can be extended across different sports to prevent injury.   (view in My Videos)   Auto-generated transcript:  
Ruth Hummel, JMP Academic Ambassador, SAS Rob Carver, Professor Emeritus, Stonehill College / Brandeis University   Statistics educators have long recognized the value of projects and case studies as a way to integrate the topics in a course. Whether introducing novice students to statistical reasoning or training employees in analytic techniques, it is valuable for students to learn that analysis occurs within the context of a larger process that should follow a predictable workflow. In this presentation, we’ll demonstrate the JMP Project tool to support each stage of an analysis of Airbnb listings data. Using Journals, Graph Builder, Query Builder and many other JMP tools within the JMP Project environment, students learn to document the process. The process looks like this: Ask a question. Specify the data needs and analysis plan. Get the data. Clean the data. Do the analysis. Tell your story. We do our students a great favor by teaching a reliable workflow, so that they begin to follow the logic of statistical thinking and develop good habits of mind. Without the workflow orientation, a statistics course looks like a series of unconnected and unmotivated techniques. When students adopt a project workflow perspective, the pieces come together in an exciting way.       Auto-generated transcript...   Speaker Transcript So welcome everyone. My name is 00 07.933 3 Ambassador with JMP. I am now a retired professor of Business 00 30.566 7 between a student and a professor working on a project. 00 49.700 11 12 engage students in statistical reasoning, teach that 00 12.433 16 to that, current thinking is that students should be learning about reproducible workflows, 00 36.266 21 elementary data management. And, again, viewing statistics as 00 58.800 25 26 wanted to join you today on this virtual call. Thanks for having 00 20.600 30 and specifically in Manhattan, and you'd asked us so so you 00 36.433 34 And we chose to do the Airbnb renter perspective. So we're 00 51.733 38 expensive. So we started filling out...you gave us 00 09.166 43 44 separate issue, from your main focus of finding a place in 00 36.066 49 you get...if you get through the first three questions, you've 00 54.100 53 know, is there a part of Manhattan, you're interested in? 00 11.133 58 repository that you sent us to. And we downloaded the really 00 26.433 32.866 63 thing we found, there were like four columns in this data set 00 46.766 67 figured out so that was this one, the host neighborhood. So 00 58.100 71 72 figured out that the first two just have tons of little tiny 00 13.300 76 Manhattan. So we selected Manhattan. And then when we had 00 29.700 80 that and then that's how we got our Manhattan listings. So 00 44.033 84 data is that you run into these issues like why are there four 00 03.300 88 restricted it to Manhattan, I'll go back and clean up some 00 18.033 92 data will describe everything we did to get the data, we'll talk 00 28.400 33.200 97 know I'm supposed to combine them based on zip, the zip code, 00 47.166 101 102 107 columns, it's just hard to find the 00 09.366 106 them, so we knew we had to clean that up. All right, we also had 00 27.366 111 journal of notes. In order to clean this up, we use the recode 00 45.500 115 Exactly. Cool. Okay, so we we did the cleanup 00 02.200 119 Manhattan tax data has this zip code. So I have this zip code 00 19.300 123 day of class, when we talked about data types. And notice in the 00 42.300 128 the...analyze the distribution of that column, it'll make a funny 00 03.200 133 Manhattan doesn't really tell you a thing. But the zip code clean data in 00 18.466 23.266 139 just a label, an identifier, and more to the point, when you want to join or merge 00 41.833 48.766 145 important. It's not just an abstract idea. You can't merge 00 03.166 11.266 150 nominal was the modeling type, we just made sure. 00 26.200 31.033 155 about the main table is the listings. I want to keep 00 45.533 159 to combine it with Manhattan tax data. Yeah. Then what? Then we need to 00 03.266 164 tell it that the column called zip clean, zip code clean... Almost. There we go. And the column called zip, which 00 33.200 171 172 Airbnb listing and match it up with anything in 00 57.033 177 178 them in table every row, whether it matches with the other or 00 13.233 182 main table, and then only the stuff that overlaps from the second 00 29.600 186 another name like, Air BnB IRS or something? Yeah, it's a lot 00 50.966 190 do one more thing because I noticed these are just data tables scattered around 00 06.666 195 running. Okay. So I'll save this data table. Now what? And really, this is the data 00 19.833 22.033 26.266 35.466 203 anything else, before we lose track of where we are, let's 00 49.733 58.800 01.833 209 or Oak Team? And then part of the idea of a project 00 23.700 214 thing. So if you grab, I would say, take the 00 50.100 218 219 220 two original data sets, and then my final merged. Okay Now 00 16.200 225 them as tabs. And as you generate graphs and 00 36.566 229 230 231 even when I have it in these tabs. Okay, that's really cool. 00 58.833 02.500 236 right, go Oak Team. Well, hi, Dr. Carver, thanks so 00 19.233 240 you would just glance at some of these things, and let me know if 00 32.300 244 we used Graph Builder to look at the price per neighborhood. And 00 45.400 248 help it be a little easier to compare between them. So we kind 00 01.000 252 have a lot of experience with New York City. So we plotted 00 18.166 256 stand in front of the UN and take a picture with all the 00 31.733 260 saying in Gramercy Park or Murray Hill. If we look back at the 00 46.566 265 thought we should expand our search beyond that neighborhood to 00 58.766 269 270 just plotted what the averages were for the neighborhoods but 00 14.533 274 the modeling, and to model the prediction. So if we could put 00 30.766 279 expected price. We started building a model and what we've 00 42.800 283 factors. And so then when we put those factors into just a 00 58.833 287 more, some of the fit statistics you've told us about in class. 00 15.466 292 but mostly it's a cloud around that residual zero line. So 00 30.766 296 which was way bigger than any of our other models. So we know 00 45.800 300 reasons we use real data. Sometimes, this is real. This is 00 58.266 304 looking? Like this is residual values. 00 19.266 309 is good. Ah, cool. Cool. Okay, so I'll look for 00 34.966 313 is sort of how we're answering our few important questions. And 00 47.300 317 was really difficult to clean the data and to join the data. 00 57.866 03.500 322 wanted to demonstrate how JMP in combination with a real world 00 28.700 327 Number one in a real project, scoping is important. We want to 00 47.600 331 hope to bring to the to the group. Pitfall number two, it's vital to explore the 00 08.033 336 the area of linking data combining data from multiple 00 27.800 341 recoding and making sure that linkable 00 45.100 345 346 reproducible research is vital, especially in a team context, especially for projects that may 00 05.966 351 habits of guaranteeing reproducibility. And finally, we hope you notice that in these 00 32.633 356 on the computation and interpretation falls by the 00 51.900 360  
Nascif Neto, Principal Software Developer, SAS Institute (JMP Division) Lisa Grossman, Associate Test Engineer, SAS Institute (JMP division)   The JMP Hover Label extensions introduced in JMP 15 go beyond traditional details-on-demand functionality to enable exciting new possibilities. Until now, hover labels exposed a limited set of information derived from the current graph and the underlying visual element, with limited customization available through the use of label column properties. This presentation shows how the new extensions let users implement not only full hover label content customization but also new exploratory patterns and integration workflows. We will explore the high-level commands that support the effortless visual augmentation of hover labels by means of dynamic data visualization thumbnails, providing the starting point for exploratory workflows known as data drilling or drill down. We will then look into the underlying low-level infrastructure that allows power users to control and refine these new workflows using JMP Scripting Language extension points. We will see examples of "drill out" integrations with external systems as well as how to build an add-in that displays multiple images in a single hover label.     Auto-generated transcript...   Speaker Transcript Nascif Abousalh-Neto Hello and welcome. This is a our JMP discovery presentation from details on demand to wandering workflows, getting to know JMP hover label extensions. Before we start on the gory details, we always like to talk about the purpose of a new feature introduced in JMP. So in this case, we're talking about hover labels extensions. And why do we even have hover labels in the first place. Well, I always like to go back to the visual information seeking mantra from Ben Shneiderman, which is he tried to synthesize overview first, zoom and filter, and then details on demand. Well hover labels are all about details on demand. So let's say I'm looking at this bar chart on this new data set and in JMP, up to JMP 14, as you hover over a particular bar in your bar chart, it's going to pop up a window with a little bit of textual data about what you're seeing here. Right. So you have labeled information, calculated values, just text, very simple. Gives you your details on demand. But what if you could decorate this with visualizations as well. So for example, if you're looking at that aggregated value, you might want to see the distribution of the values that got that particular calculation. Or you might want to see a breakdown of the values behind the that aggregated value. This is what we're gonna let you know with this new visualization, with this new feature. But on top of that, it's the famous, wait, there is more. This new visualization basically allows you to go on and start the visual exploratory workflow. If you click on it, you can open it up in its own window, which allows you to which can also have its visualization, which you can also click and get even more detail. And so you go down that technique called the drill down and eventually, you might get to a point where you're decorating a particular observation with information you're getting from maybe even Wikipedia in that case. Not going to go into a lot of details. We're going to learn a lot about all that pretty soon. But first, I also wanted to talk a little bit about the design decisions behind the implementation of this feature. Because we wanted to have something that was very easy to use that didn't require programming or, you know, lots of time reading the manual and we knew that would satisfy 80% of the use cases. But for those 20% of really advanced use cases or for those customers that know their JSL and they just want to push the envelope on what JMP can do, we also want to make available, something that you could do through programming. But basically, your top of the context of ??? on those visual elements. So we decided to go with architectural pattern called plumbing and porcelain, and that's something we got to git source code control application, which is basically you have a layer that is very rich and because it's very rich, very complex, which gives you access to all that information and allows you to customize things that are going to happen as far as generating the visualization or what happens when you click on that visualization And on top of that, we built a layer that is more limited, its purpose driven, but it's very, very easy to do and requires no coding at all. So that's the porcelain layer. And that's the one that Lisa is going to be talking about now. Up to you. Lisa. I'm going to stop sharing and Lisa is going to take over. Lisa Grossman Okay so we are going to take a high level look at some of the features and what kind of customization system, make the graphic ??? So, let us first go through some of the basics. So by default when you hover over a data point or an element in your graph. you see information displayed for the X and Y roles used in the graph, as well as any drop down roles such as overlay and if you choose to manually manually label a column in the data table, that will also appear as a hover label. So here we have an example of a label, the expression column tha contains an image. And so we can see that image is then populated in hover label in the back. And to add a graphlet to your hover label, you have the option of selecting some predefined graphlet presets, which you can access via the right mouse menu under hover label. Now these presets have dynamic graph role assignments and derive their roles from variables used in your graph. And presets are also preconfigured to be recursive and that will support drilling down. And for preset graphlets that have categorical columns, you can specify which columns to filter by, by using the next in hierarchy column property that's in your data table. And so now I'm going to demo real quick how to make a graphlet preset. So I'm going to bring up our penguins data table that we're going to be using. And I'm going to open up Graph Builder. And I'm going to make a bar chart here. And then right clicking under hover label, you can see that there is a list of different presets to choose from, but we're going to use histogram for this example. So now that we have set our preset, if you hover over a bar, now you can see that there's a histogram preset that pops up in your hover label. And it's also... it is also filtered based on our bar here, which is the island Biscoe. And the great thing about graphlets is if I hover over this bar, I can see another graphlet. And so now you can easily compare these two graphlets to see the distribution of bill lengths for both the islands Dream and Biscoe. And then you can take it a step further and click on the thumbnail of the graphlet and it will launch a Graph Builder instance in its own window and it's totally interactive so you can open up the control panel of Graph Builder and and customize this graph further. And then as you can see, there's a local data filter already applied to this graph, and it is filtered by Biscoe, which is the thumbnail I launched. So, that is how the graphlets are filtered by. And then one last thing is that if I hover over these these histogram bars, you can see that the histogram graphlet continues on, so that shows how these graphlet presets are pre configured to be recursive. So closing these and returning back to our PowerPoint. So I only showed the example of the histogram preset but there are a number that you can go and play with. So these graphlet presets help us answer the question of what is behind an aggregated visual element. So the scatter plot preset shows you the exact values, whereas the histogram, box plot or heat map presets will show you a distribution of your values. And if you wanted to break down your graph and look at your graph with another category, then you might be interested in using a bar, pie, tree map, or a line preset. And if you'd like to examine your raw data of the table, then you can use the tabulate preset. But if you'd like to further customize your graphlet, you do have the option to do so with paste graphlets. And so paste graphlet, you can easily achieve with three easy steps. So you would first build a graph that you want to use as a graphlet. And we do want to note here that it does not have to be one built from Graph Builder. And then from the little red triangle menu, you can save the script of the graph to your clipboard. And then returning to your base graph or top graph, you can right click and under hover label, there will be a paste graphlet option. And that's really all there is to it. And we want to also note that paste graphlet will have static role assignments and will not be recursive since you are creating these graph lets to drill down one level at a time. But if you'd like to create a visualization with multiple drill downs, then you can, you have the option to do so by nesting paste graphlet operations together, starting from the bottom layer going up to your top or base later. So, and this is what we would consider our Russian doll example, and I can demo how you can achieve that. So we'll pull up our penguins data table again. And we'll start with the Graph Builder and we'll we're going to start building our very top layer for this. So let's go ahead build that bar chart. And then let's go on to build our very second...our second layer. So let's do a pie with species. And then for our very last layer, let's do a scatter plot. OK, so now I have all three layers of our...of what we will use to nest and so I will go and save the script of the scatter plot to my clipboard. And then on the pie, I right click and paste graphlet. And so now when you hover, you can see that the scatter plot is in there and it is filtered by the species in this pie. So I'm going to close this just for clarity and now we can go ahead and do the same thing to the pie, save the script, because it already has the scatter plot embedded. So save that to our clipboard, go over to our bar, do the same thing to paste graphlet. And now we have... we have a workflow that is... that you can click and hover over and you can see all three layers that pop up when you're hovering over this bar. So that's how you would do your nested paste graphlets. And so we do want to point out that there are some JMP analytical platforms that already have pre integrated graphlets available. So these platforms include the functional data explorer, process screening, principal components, and multivariate control charts, and process capabilities. And we want to go ahead and quickly show you an example using the principal components. Lost my mouse. There we go. So I launch our table again and open up principal components. And let's do run this analysis. And if I open up the outlier analysis and hover over one of these points, boom, I can see that these graphlets are already embedded into this platform. So we highly suggest that you go and take a look at these platforms and play around with it and see what you like. And so that was a brief overview of some quick customizations you can do with hover label graphlets and I'm going to pass this presentation back to Nascif so he can move you through the plumbing that goes behind all of these features. Nascif Abousalh-Neto Thank you, Lisa. Okay, let's go back to my screen here. And we... I think I'll just go very quickly over her slides and we're back to plumbing, and, oh my god, what is that? This is the ugly stuff that's under the sink. But that's where you have all the tubing and you can make things really rock, and let me show them by giving a quick demo as well. So here Lisa was showing you the the histogram... the hover label presets that you have available, but you can also click here and launch the hover label editor and this is the guy where you have access to your JSL extension points, which is where you make, which is how those visualizations are created. Basically what happens is that when you hover over, JMP is gone to evaluate the JSL block and capture that as an in a thumbnail and put that thumbnail inside your hover label. That's pretty much, in a nutshell, how it goes. And the presets that you also have available here in the hover label, right, they basically are called generators. So if I click here on my preset and I go all the way down, you can see that it's generating the Graph Builder using the histogram element. That's how it does its trick. Click is a script that is gonna react to when you click on that thumbnail, but by default (and usually people stick with the default), if you don't have anything here, it's just, just gonna launch this on its own window, instead of capturing and scale down a little image. In here on the left you can see two other extension points we haven't really talked much about yet. But we will very soon. So I don't want to get ahead of myself. So, So let's talk about those extension points. So we created not just one but three extension points in JMP 15. And they are, they're going to allow you to edit and do different functionality to different areas of your hover label. So textlets, right, so let's say for example you wanted to give a presentation after you do your analysis, but you want to use the result of that analysis and present it to an executive in your company or maybe we've an end customer that wants a little bit more of detail in in a way that they can read, but you would like make that more distinct. So textlet allows you to do that. But since you're interfacing with data, you also want that to be not a fixed block of text, but something that's dynamic that's based on the data you're hovering over. So to define a textlet, you go back to that hover label editor and you can define JSL variables or not. But if you want it to be dynamic, typically, what you do is you define a variable that's going to have the content that you want to display. And then you're going to decorate that value using HTML notation. So, here is how you can select the font, you can select background colors, foreground colors, you can make it italic, and basically make it as pretty or rich of text as you as you need to. Then the next hover labelextension is the one we call gridlet. And if you remember the original or the current JMP hover label, it's basically a grid of name value pairs. To the left, you have names of your...that would be the equivalent to your column name, and to the right, you have the values which might be just a column cell for a particular row if it's a marked plot. But if it's aggregated like a bar chart, this is going to be a mean or an average medium, something like that. The default content from here, like Lisa said before, is derived at both from the...originally is derived both from whatever labeled columns you have in your data table and also, whatever role assignments you have in your graph. So if it's a bar chart, you have your x, you have your y. You might have an overlay variable and everything that in at some point contributes to the creation of that visual element. Well with gridlets you can now have pretty much total control of that little display. You can remove entries. It's very common that sometimes people don't want to see the very first row, which has the labeles or the number of rows. Some people find that redundant. They can take it out. You can add something that is completely under your control. Basically it's going to evaluate the JSL script to figure out what you want to display there. One use case I found was when someone wanted an aggregated value for a column that was not individualization. Some people call those things hidden columns or hidden calculations. Now you can do that, right, and have an aggregation for the same rows that the rest of that that are being displayed on that visualization. You can rename. We usually add the summary statistic to the left of anything that comes from a y calculated column. If you don't like that, now you can remove it or replace it with something else. And as well...and then you can do details like changing the numeric precision or make text bold or italics or red or... even for example, you can make it red and bold, if the value is above a particular threshold. So you can have something that, as I move over here, if the value is over the average of my data I make it red and bold so I can call attention to that. And that will be automatic for you. And finally, graphlets. We believe that's going to be the most useful and used one. Certainly don't want that to cause more attention because you have a whole image inside your tool tip and we've been seeing examples with data visualizations, but it's an image. So it can be a picture as well. It can be something you're downloading from the internet on the fly by making a web call. That's how I got the image of this little penguin. It's coming straight from Wikipedia. As you hover over, we download it, scale it and and put it here. Or you can, for example, that's a very recent use case, someone had a database of pictures in the laboratory and they have pictures of the samples they were analyzing and they didn't want to put them on the data table because the data table would be too large. Well, now you can just get a column, turn that column into a file name, read from the file name, and boom, display that inside your tool tip. So when you're doing your analysis, you know, exactly, exactly what you're looking at. And just like graph...gridlets, we're talking about clickable content. So again, for example, if I wanted and I showed that when I click on this little thumbnail here, I can open a web page. So you can imagine that even as a way to integrate back with your company. Let's say you have web services that they're supported in your company, and you want to, at some point, maybe click on an image to make a call to kind of register or capture some data. Go talking for a web call to that web service. Now that's something you can do. So I like to call, we talk about drill in and drill down, that would be a drill out. That's basically JMP talking to the outside world using data content from your exploration. So let's look at those things in the little bit more detail. So those those visualizations that we see here inside the hover label, they are basically... that's applied to any visualization. Actually it's a combination of a graph destination and the data subset. So in the Graph Builder, for example, you'll say, I want the bar chart of islands by on my x axis and on my y axis, I want to show the average of the body mass of the penguins on that island. Fine. How do you translate that to a graphlet, right? Well, basically when you select the preset or when you write in your code if you want to do it, but the preset is going to is going to use our graph template. So basically, some of the things are going to be predefined like that. The bar element, although if you're writing it your own, you could even say I want to change my visualization depending on my context. That's totally possible. And you're going to fill that template with a graph roles and values and table data, table metadata. So, for example, let's say I have a preset of doing that categorical drill down. I know it's going to be a bar chart. I don't know what a bar chart is going to be, what's going to be on my y or my x axis. That's going to come from the current state of my baseline graph, for example, I'm looking at island. So I know I want to do a bar chart of another category. So that's when the next in hierarchy and the next column comes into play. I'm making that decision on the fly, based on the information that user is giving me and the graph that's being used. For example, if you look here at the histogram, it was a bar chart of island by body mass. This is a histogram of body mass as well. If I come here to the graph and change this column and then I go back and hover, this guy is going to reflect my new choice. That's this idea of getting my context and having a dynamic graph. The other part of the definition of visualization is the data subset. And we have a very similar pattern, right. We have...LDF is local data filter. So that's a feature that we already had in JMP, of course, right. And basically, I have a template that is filled out from my graph roles here. It's like if it was a bar chart, which means my x variable is going to be a grouping variable of island. I know I wanted to have a local data filter of island and that I want to select this particular value so that it matches the value I was hovering over. This happens both when you're creating the hover label and when you're launching the hover label, but when you create a hover label, this is invisible. We basically create a hidden window to capture that window so you'll never see that guy. But when you launch it, the local data filter is there and as Lisa has shown, you can interact with it and even make changes to that so that you can progress your your, your visual exploration on your own terms. So I've been talking about context, a lot. This is actually something that you should need to develop your own graphlets, you need to be familiar with. We call that hover label execution context. You're going to have information about that in our documentation and it's basically if you remember JSL, it's a local block. We've lots of local variables that we defined for you and those those variables capture all kinds of information that might be useful for someone to find in the graphlet or a gridlet or a textlet. It's available for all of those extension points. So typically, they're going to be variables that start with a nonpercent... Not a nonpercent...I'm sorry. To prevent collisions with your data table column names, so it's kinda like reserved names in a way. But basically, you'll see here that that's that's code that comes from one of our precepts. By the way, that code is available to you through the hover label editor, so you can study and see how it goes. Here we're trying to find a new column. To using our new graph, it's that idea of it being dynamic and to be reactive to the context. And this function is going to look into the data table for that metadata. My...a list of measurement columns. So if the baseline is looking at body mass, body mass is going to be here in this value and at a list of my groupings. So if it was a bar chart of island by body mass, we're going to have islands here. So those are lists of column names. And then we also have any of numeric values, anything that's calculated is going to be available to you. Maybe you want to, like I said, maybe you want to make a logical decision based on the value being above or below the threshold so that you can color a particular line red or make it bold, right. You're going to use values that we provide to you. We also provide something that allow you to go back to the data. In fact, to the data table and fetch data by yourself like the row index of the first row on the list of roles that your visual element discovering, that's available to you as well. And then the other even more data, like for example the where clause that corresponds to that local data filter that you're executing in the context of. And the drill depth, let's say, that allows you to keep track of how many times you have gone on that thumbnail and open a new visualization and so on. So for example, when we're talking about recursive visualizations, every recursion needs an exit condition, right. So here, for example, is how you calculate the exit condition of one of your presets. If I don't have anything more to to show, I return empty, means no visualization. Or if I don't have...if I only show you one value, right, or any of my drill depth is greater than one, meaning I was drilling until I got to a point where just only one value to show in some visualizations doesn't make sense. So I can return empty as well. That's just an example of the kinds of decisions that you can make your code using the hover label execution context. Now, I just wanted to kind of gives you a visual representation of how all those things come together again using the preset example. When you're selecting a preset, you're basically selecting the graph template, which is going to have roles that are going to be fulfilled from the graph roles that are in your hover label execution context. And so that's your data, your graph definition. And that date graph definition is going to be combined with the subset of observations resulting from the, the local data filter that was also created for you behind the scenes, based on the visual element you're hovering over. So when you put those things together, you have a hover label, we have a graphlet inside. And if you click on that graphlet, it launches that same definition in here and it makes the, the local data filter feasible as well. When, like Lisa was saying, this is a fully featured life visualization, not just an image, you can make changes to this guy to continue your exploration. So now we're talking, you should think in terms of, okay, now I have a feature that creates visualizations for me and allow me to create one visualization from another. I'm basically creating a visual workflow. And it's kind of like I have a Google Assistant or an Alexa in JMP, in the sense that I can...JMP is making me go faster by creating, doing visualizations on my behalf. And they might be, also they might be not, just an exploration, right. If you're happy with them, they just keep going. If you're not happy with them, you have two choices and maybe it's easier if I just show it to you. So like I was saying, I come here, I select a preset. Let's say I'm going to get a categoric one bar chart. So that gives me a breakdown on the next level. Right. And if I'm happy with that, that's great. Maybe I can launch this guy. Maybe I can learn to, whoops... Maybe I can launch another one for this feature. At the pie charts, they're more colorful. I think they look better in that particular case. But see, now I can even do things like comparing those two bar charts side by side. And let's...but let's say that if I keep doing that and it isn't a busy chart and I keep creating visualizations, I might end up with lots of windows, right. So that's why we created some modifiers to...(you're not supposed to do that, my friend.) You can just click. That's the default action, it will just open another window. If you alt-click, it launches on the previous last window. And if you control-click it launches in place. What do I mean by that? So, I open this window and I launched to this this graphlet and then I launched to this graphlet. So let's say this is Dream and Biscoe and Dream and Biscoe. Now I want to look at Torgersen as well. Right. And I want to open it. But if I just click it opens on its own window. If I alt-click, (Oh, because that's the last one. I hope. I'm sorry. So let me close this one.) Now if I go back here in I alt-click on this guy. See, it replaced the content of the last window I had open. So this way I can still compare with visualizations, which I think it's a very important scenario. It's a very important usage of this kind of visual workflow. Right. But I can kind of keep things under control. And I don't just have to keep opening window after window. And the maximum, the real top window management feature is if I do a control-click because it replaces the window. And then, then it's a really a real drill down. I'm just going on the same window down and down and now it's like okay, but what if I want to come back. Or if you want to come back and just undo. So you can explore with no fear, not going to lose anything. Even better though, even the windows you launch, they have the baseline graph built in on the bottom of the undo stack. So I can come here and do an undo and I go back to the visualizations that were here before. So I can drill down, come back, branch, you can do all kinds of stuff. And let's remember, that was just with one preset. Let's do something kind of crazy here. We've been talking, we've been looking at very simple visualizations. But this whole idea actually works for pretty much any platform in JMP. So let's say I want to do a fit of x by y. And I want to figure out how...now, I'm starting to do real analytics. How those guys fit within the selection of the species. Right. So I have this nice graph here. So I'm going to do that paste graphlet trick and save it to the clipboard. And I'm going to paste it to the graphlet now. So as you can see, we can use that same idea of creating a context and apply that to my, to my analysis as well. And again, I can click on those guys here and it's going to launch the platform. As long as the platform supports local data filters, (I should have given this ???), this approach works as well. So it's for visualizations but in...since in JMP, we have this spectrum where the analytics also have a visual component, so works with our analytics as well. And I also wanted to show here on that drill down. This is my ??? script. So I have the drill down with presets all the way, and I just wanted to go to the the bottom one where I had the one that I decorated with this little cute penguin. But what I wanted to show you is actually back on the hover label editor. Basically what I'm doing here, I'm reading a small JSL library that I created. I'm going to talk about that soon, right, and now I can use this logic to go and fetch visualizations. In this case I'm fetching it from Wikipedia using a web call. And that visualization comes in and is displayed on my visualization. It's a model dialogue. But also my click script is a little bit different. It's not just launching the guy; it's making a call to this web functionality after getting a URL, using that same library as well. So what exactly is it going to do? So when I click on the guy, it opens a web page with a URL derived from data from my visualization and this can be pretty much anything JSL can do. I just want to give us an example of how this also enables you integration with other systems, even outside of JMP. Maybe I want to start a new process. I don't know. All kinds of possibilities. That I apologize. So So there are two customized...advanced customization examples, I should say, that illustrate how you can use graphlets as a an extensible framework. They're both on the JMP Community, you can click here if you get the slides, but one is called the label viewer. I am sorry. And basically what it does is that when you hover over a particular aggregated graph, it finds all the images on the graph...on the data table associated with those rows and creates one image. And that's something customers have asked for a while. I don't want to see just one guy. I want to see if you have more of them, all of them. Or, if possible, right. So when you actually use this extension, and you click on...actually no, I don't have it installed so... And the wiki reader, which was the other one, is the one I just showed to you. Bbut was what I was saying is that when you click and launch this particular...on this particular image, it launches a small application that allows you to page through the different images in your data table and you have a filter that you can control and all that. This is one that was completely done in JSL on top of this framework. So just to close up, what did we learn today? I hope that you found that it's now very easy to add visualizations, you can visualize your visualizations, if you will. It's very easy to add those data visualization extensions using the porcelain features. You actually have not just richer detail on your thumbnails, but you have a new exploratory visual workflow, which you can customize to meet your needs by using either paste graphlet, if you want to have something easy to do, or you can even use JSL using the hover label editor. We're both very curious to see what you've...how you guys are going to use that in the field. So if you come with some interesting examples, please call us back. Send us a screenshot in the JMP Community and let us know. That's all we have today. Thank you very much. And when we give this presentation, we're going to be here for Q&A. So, thank you.  
Jeremy Ash, JMP Analytics Software Tester, JMP   The Model Driven Multivariate Control Chart (MDMVCC) platform enables users to build control charts based on PCA or PLS models. These can be used for fault detection and diagnosis of high dimensional data sets. We demonstrate MDMVCC monitoring of a PLS model using the simulation of a real world industrial chemical process — the Tennessee Eastman Process. During the simulation, quality and process variables are measured as a chemical reactor produces liquid products from gaseous reactants. We demonstrate fault diagnosis in an offline setting. This often involves switching between multivariate control charts, univariate control charts, and diagnostic plots. MDMVCC provides a user-friendly way to move between these plots. Next, we demonstrate how MDMVCC can perform online monitoring by connecting JMP to an external database. Measuring product quality variables often involves a time delay before measurements are available, which can delay fault detection substantially. When MDMVCC monitors a PLS model, the variation of product quality variables is monitored as a function of process variables. Since process variables are often more readily available, this can aide in the early detection of faults. Example Files Download and extract streaming_example.zip.  There is a README file with some additional setup instructions that you will need to perform before following along with the example in the video.  There are also additional fault diagnosis examples provided. Message me on the community if you find any issues or have any questions.       Auto-generated transcript...   Speaker Transcript Jeremy Ash Hello, I'm Jeremy ash. I'm a statistician in jump R amp D. My job primarily consists of testing the multivariate statistics platforms and jump but   I also help research and evaluate methodology and today I'm going to be analyzing the Tennessee Eastman process using some statistical process control methods and jump.   I'm going to be paying particular attention to the model driven multivariate control chart platform, which is a new addition to jump and I'm really excited about this platform and these data provided a new opportunity to showcase some of its features.   First, I'm assuming some knowledge of statistical process control in this talk.   The main thing you need to know about is control charts. If you're not familiar with these. These are charts used to monitor complex industrial systems to determine when they deviate from normal operating conditions.   I'm not gonna have much time to go into the methodology and model driven multivariate control chart. So I'll refer to these other great talks that are freely available.   For more details. I should also mention that Jim finding was that primary developer of the model driven multivariate control chart and in collaboration with Chris Got Walt and Tanya Malden I were testers.   So the focus of this talk will be using multivariate control charts to monitor a real world chemical process.   Another novel aspect of this talk will be using control charts for online process monitoring this means we'll be monitoring data continuously as it's added to a database and texting faults in real time.   So I'm going to start with the obligatory slide on the advantages of multivariate control charts. So why not use University control charts there. There are a number of excellent options and jump.   University control charts are excellent tools for analyzing a few variables at a time. However, quality control data sets are often high dimensional   And the number of charts that you need to look at can quickly become overwhelming. So multivariate control charts summarize a high dimensional process. And just a few charts and that's a key advantage.   But that's not to say that university control charts aren't useful in this setting, you'll see throughout the talk that fault diagnosis often involves switching between multivariate in University of control charts.   Multivariate control charts, give you a sense of the overall health of a process well University control charts allow you to   Look at specific aspects. And so the information is complimentary and one of the main goals of model driven multivariate control chart was to provide some tools that make it easy to switch between those two types of charts.   One disadvantage of the university control chart is that observations can appear to be in control when they're actually out of control in the multivariate sense. So I have to   Control our IR charts for oil and density and these two observations in red are in control, but oil and density are highly correlated. And these observations are outliers in the multivariate sense in particular observation 51 severely violates the correlation structure.   So multivariate control charts can pick up on these types of outliers. When University control charts can't   model driven multivariate control chart uses projection methods to construct its control charts. I'm going to start by explaining PCA because it's easy to build up from there.   PCA reduces dimensionality of your process variables by projecting into a low dimensional space.   This is shown in the in the picture to the right we have p process variables and and observations and we want to reduce the dimensionality of the process to a were a as much less than p and   To do this we use this P loading matrix, which provides the coefficients for linear combinations of our X variables which give the score variables. The shown and equations on the left.   tee times P will give you predicted values for your process variables with the low dimensional representation. And there's some prediction air and your score variables are selected.   In a way that minimizes this squared prediction air. Another way to think about it is, you're maximizing the amount of variance explained x   Pls is more suitable when you have a set of process variables and a set of quality variables and you really want to ensure that the quality variables are kept in control, but these variables are often expensive or time consuming to collect   At planet can be making out of control quality for a long time before fault is detected, so   Pls models allow you to monitor your quality variables as a function of your process variables. And you can see here that pls will find score variables that maximize the variance explained in the y variables.   The process variables are often cheaper and more readily available. So pls models can allow you to detect quality faults early and can make process monitoring cheaper.   So from here on out. I'm just going to focus on pls models because that's that's more appropriate for our example.   So pls partitions your data into two components. The first component is your model component. This gives you the predicted values.   Another way to think about this as your data has been projected into a model plane defined by your score variables and t squared charts will monitor variation in this model plane.   The second component is your error component. This is the distance between your original data and that predicted data and squared prediction error charts are sp charts will monitor   Variation in this component   We also provide an alternative distance to model x plane, this is just a normalized version of sp.   The last concept that's important to understand for the demo is the distinction between historical and current data.   historical data typically collected when the process is known to be in control. These data are used to build the PLS model and define   Normal process variation. And this allows a control limit to be obtained current data are assigned scores based on the model, but are independent of the model.   Another way to think about this is that we have a training and a test set, and the t squared control limit is lower for the training data because we expect lower variability for   Observations used to train the model, whereas there's greater variability and t squared. When the model generalized is to a test set. And fortunately, there's some theory that's been worked out for the   Variants of T square that allows us to obtain control limits based on some distributional assumptions.   In the demo will be monitoring the Tennessee Eastman process. I'm going to present a short introduction to these data.   This is a simulation of a chemical process developed by downs and Bogle to chemists at Eastman Chemical and it was originally written in Fortran, but there are rappers for it in MATLAB and Python now.   The simulation was based on a real industrial process, but it was manipulated to protect proprietary information.   The simulation processes. The, the production of to liquids.   By gassing reactants and F is a byproduct that will need to be siphoned off from the desired product.   The two season processes pervasive in the in the literature on benchmarking multivariate process control methods.   So this is the process diagram. It looks complicated, but it's really not that bad. So I'm going to walk you through it.   The gaseous reactants ad and he are flowing into the reactor here, the reaction occurs and product leaves as a gas. It's been cooled and condensed into a liquid and the condenser.   Then we have a vapor liquid separator that will remove any remaining vapor and recycle it back to the reactor through the compressor and there's also a purge stream here that will   Vent byproduct and an art chemical to prevent it from accumulating and then the liquid product will be pumped through a stripper where the remaining reactants are stripped off and the final purified product leaves here in the exit stream.   The first set of variables that are being monitored are the manipulated variables. These look like bow ties and the diagram.   Think they're actually meant to be valves and the manipulative variables, mostly control the flow rate through different streams of the process.   These variables can be set to specific values within limits and have some Gaussian noise and the manipulative variables can be sampled at any rate, we're using a default three minutes sampling in   Some examples of the manipulative variables are the flow rate of the reactants into the reactor   The flow rate of steam into the stripper.   And the flow of coolant into the reactor   The next set of variables are measurement variables. These are shown as circles in the diagram and they're also sampled in three minute intervals and the difference is that the measurement variables can't be manipulated in the simulation.   Our quality variables will be percent composition of to liquid products you can see   The analyzer measuring the composition here.   These variables are collected with a considerable time delay so   We're looking at the product in the stream because   These variables can be measured more readily than the product leaving in the exit stream. And we'll also be building a pls model to monitor   monitor our quality variables by means of our process variables which have substantial substantially less delay in a faster sampling rate.   Okay, so that's an a background on the data. In total there are 33 process variables into quality variables.   The process of collecting the variables is simulated with a series of differential equations. So this is just a simulation. But you can see that a considerable amount of care went into model modeling. This is a real world process.   So here's an overview of the demo, I'm about to show you will collect data on our process and then store these data in a database.   I wanted to have an example that was easy to share. So I'll be using a sequel light database, but this workflow is relevant to most types of databases.   Most databases support odd see connections once jump connects to the database it can periodically check for new observations and update the jump table as they come in.   And then if we have a model driven multivariate control chart report open with automatic re calc turned on. We have a mechanism for updating the control charts as new data come in.   And the whole process of adding data to a database will likely be going on on a separate computer from the computer doing the monitoring.   So I have two sessions of jump open to emulate this both sessions have their own journal in the materials are provided on the Community.   And the first session will add simulated data to the database and it's called the streaming session and the next session will update reports as they come into the database and I'm calling that the monitoring session.   One thing I really liked about the downs and Vogel paper was that they didn't provide a single metric to evaluate the control of the process. I have a quote from the paper here. We felt like   We felt that the trade offs among possible control strategies and techniques involved, much more than a mathematical expression.   So here's some of the goals they listed in their paper which are relevant to our problem maintain the process variables that desired values minimize variability of the product quality during disturbances and recover quickly and smoothly from disturbances.   So we will assess how well our process achieve these goals, using our monitoring methods.   Okay.   So to start off, I'm in the monitoring session journal and I'll show you our first data sent the data table contains all the variables I introduced earlier, the first set are the measurement variables. The next set our composition variables. And then the last set are the manipulated variables.   And the first script attached here will fit a pls model it excludes the last hundred rose is a test set.   And just as a reminder, this model is predicting our two product composition variables as a function of our process variables but pls model or PLS is not the focus of the talk. So I've already fit the model and output score columns here.   And if we look at the column properties. You can see that there's a MD MCC historical statistics property that contains all the information   On your model that you need to construct the multivariate control charts. One of the reasons why monitoring multivariate control chart was designed this way was   Imagine you're a statistician, and you want to share your model with an engineer, so they can construct control charts. All you need to do is provide the data table with these formula columns. You don't need to share all the gory details of how you fit your model.   So next I will use the score columns to create our control turn   On the left, I have to control charts t squared and SPE there 860 observations that were used to estimate the model. And these are labeled as historical and then I have 100 observations that were held out as a test set.   And you can see in the limit summaries down here that I performed a bond for only correction for multiple testing.   As based on the historical data. I did this up here in the red triangle menu, you can set the alpha level, anything you want and   I did this correction, because the data is known to be a normal operating conditions. So, we expect no observations to be out of control and after this multiplicity adjustment, there are zero false alarms.   On the right or the contribution proportion heat maps. These indicate how much each variable contributes to the outer control signal each observation is on the Y axis and the contributions are expressed as a proportion   And you can see in both of these plots that the contributions are spread pretty evenly across the variables.   And at the bottom. I have a score plant.   Right now we're just plotting the first score dimension versus the second score dimension, but you can look at any combination of the score dimensions using these drop down menus, or this arrow.   Okay, so we're pretty oriented to the report, I'm going to switch over to the monitoring session.   Which will stream data into the database.   In order to do anything for this example, you'll need to have a sequel light odd see driver installed. It's easy to do. You can just follow this link here.   And I don't have time to talk about this but I created the sequel light database. I'll be using and jump I have instructions on how to do this and how to connect jump to the database on my community webpage   This is example might be helpful if you want to try this out on date of your own.   I've already created a connection to this database.   And I've shared the database on the community. So I'm going to take a peek at the data tables in query builder.   I can do that table snapshot   The first data set is the historical data I I've used this to construct a pls model, there are 960 observations that are in control.   The next data table is a monitoring data table this it is just contains the historical data at first, but I'll gradually add new data to this and this is what our multivariate control chart will be monitoring.   And then I've simulated the new data already and added it to this data table here and see it starts at timestamp 961   And there's another 960 observations, but I've introduced a fault at some time point   And I wanted to have something easy to share. So I'm not going to run my simulation script and add the database that way.   I'm just going to take observations from this new data table and move them over to the monitoring data table using some JSON with sequel statements.   And this is just a simple example emulating the process of new data coming into a database, somehow, you might not actually do this with jump. But this is an opportunity to show how you can do it with ASL.   Next, I'll show you the script will use to stream in the data.   This is a simple script. So I'm just going to walk you through it real quick.   The first set of commands will open the new data table from the sequel light database, it opens up in the background. So I have to deal with the window, and then I'm going to take pieces from this new data table and   move them to the monitoring data table I'm calling the pieces bites and the BITE SIZES 20   And then this will create a database connection which will allow me to send the database SQL statements. And then this last bit of code will interactively construct sequel statements that insert new data into the monitoring data. So I'm going to initialize   Okay, and show you the first iteration of this loop.   So this is just a simple   SQL statement insert into statement that inserts the first 20 observations.   Comment that outset runs faster. And there's a wait statement down here. This will just slow down the stream.   So that we have enough time to see the progression of the data and the control charts by didn't have this this streaming example would just be over too quick.   Okay, so I'm going to   Switch back to the monitoring session and show you some scripts that will update the report.   Move this over to the right. So you can see the report and the scripts at the same time.   So,   This read from monitoring data script is a simple script that checks the database every point two seconds and adds new data to the jump table. And since the report has automatic recount turned on.   The report will update whenever new data are added. And I should add that realistically, you probably wouldn't use a script that just integrates like this, you probably use Task Scheduler and windows are automated and Max better schedule schedule the runs   And then the next script here.   will push the report to jump public whenever the report is updated.   I was really excited that this is possible and jump.   It enables any computer with a web browser to view updates to the control chart. You can even view the report on your smartphone. So this makes it easy to share results across organizations. You can also use jump live if you wanted the reports to be on a restricted server.   And then the script will recreate the historical data and the data table in case you want to run the example multiple times.   Okay, so let's run the streaming script.   And look at how the report updates.   You can see the data is in control at first, but then a fault is introduced, there's a large out of control signal, but there's a plant wide control system that's been implemented and the simulation, which brings the system to a new equilibrium   I give this a second to finish.   And now that I've updated the control chart. I'm going to push the results to jump public   On my jump public page I have at first the control chart with the data and control at the beginning.   And this should be updated with the addition of the data.   So if we zoom in on the when the process first went out of control.   Your Jeremy Ash It looks like that was sample 1125 I'm going to color that   And label it.   So that it shows up in other plots and then   In the SP plot it looks like this observation is still in control.   And what chart will catch faults earlier depends on your model. And how many factors, you've chosen   We can also zoom in on   That time point in the contribution plot. And you can see when the process. First goes out of control. There's a large number of variables that are contributing to the out of control signal. But then when the system reaches a new equilibrium, only two variables have large contributions.   So I'm going to remove these heat maps so that I'm more room in the diagnostic section.   And to make everything pretty pretty large so that the text would show up on your screen.   If I hover over the first point that's out of control. You can get a peek at the top 10 contributing variables.   This is great for quickly identifying what variables are contributing the most to the out of control signal. I can also click on that plot and appended to the diagnostic section and   You can see that there's a large number of variables that are contributing to the out of control signal.   zoom in here a little bit.   So if one of the bars is red. This means that variable is out of control.   In a universal control chart. And you can see this by hovering over the bars.   I'm gonna pan, a couple of those   And these graph, let's our IR charts for the individual variables with three sigma control limits.   You'd see for the stripper pressure variable. The observation is out of control in the university control chart, but the variables eventually brought back under control by our control system. And that's true for   Most of the   Large contributing variables and also show you one of the variables where observation is in control.   So once the control system responds many variables are brought back under control and the process reaches   A new equilibrium   But there's obviously a shift in the process. So to identify the variables that are contributing to the shift. And one thing you can look at is a main contribution.   Plot   If I sort this and look at   The variables that are most contributing. It looks like just two variables have large contributions and both of these are measuring the flow rate of react in a in a stream one which is coming into the reactor   And these are measuring essentially the same thing except one is a measurement variable and one's a manipulated variable. And you can see   In the university control chart that there's a large step change in the flow rate.   This one as well. And this is the step change that I programmed in the simulation. So these contributions allow us to quickly identify the root cause.   So I'm going to present a few other alternate methods to identify the same cause of the shift. And the reason is that in real data.   Process shifts are often more subtle and some of the tools may be more useful and identifying them than others and will consistently arrive at the same conclusion with these alternate methods. So it'll show some of the ways that these methods are connected   Down here, I have a score plant which can provide supplementary information about shifts in the t squared plant.   It's more limited in its ability to capture high dimensional shifts, because only two dimensions of the model are visualized at a time, however, we can provide a more intuitive visualization of the process as it visuals visualizes it in a low dimensional representation   And in fact, one of the main reasons why multivariate control charts are split into t squared and SPE in the first place is that it provides enough dimensionality reduction to easily visualize the process and the scatter plot.   So we want to identify the variables that are   Causing the shift. So I'm going to, I'm going to color the points before and after the shift.   So that they show up in the score plot.   Typically, when we look through all combinations of the six factors, but that's a lot of score plots to look through   So something that's very handy is the ability to cycle through all combinations quickly with this arrow down here and we can look through the factor combinations and find one where there's large separation.   And if we wanted to identify where the shift first occurred in the score plots, you can connect the dots and see that the shift occurred around 1125 again.   Another useful tool. If you want to identify   Score dimensions, where an observation shows the largest separation from the historical data and you don't want to look through all the score plots is the normalized score plot. So I'm going to select a point after the shift and look at the normalized score plot.   I'm actually going to choose another one.   Okay. Jeremy Ash Because I want to look at dimensions, five, and six. So the   These plots show the magnitude of the score and each dimension normalized, so that the dimensions are on the same scale. And since the mean of the historical data is is that zero for each score to mention the dimensions with the largest magnitude will show the largest separation.   Between the selected point and the historical data. So it looks like here, the dimensions, five and six show the greatest separation and   I'm going to move to those   So there's large separation here between our   Shifted data and the historical data and square plot visualization is can also be more interpreted well because you can use the variable loadings to assign meaning to the factors.   And   Here I have   We have too many variables to see all the labels for them.   Loading vectors, but you can hover over and see them. And you can see, if I look in the direction of the shift that the two variables that were the cause show up there as well.   We can also explore differences between sub groups in the process with the group comparisons to do that I'll select all the points before the shift in call that the reference group and everything after in call that the group I'm comparing to the reference   These   And this contribution plot will will give me the variables that are contributing the most to the difference between these two groups. And you can see that this also identifies the variables that caused the shift.   The group comparisons tool is particularly useful when there's multiple shifts in a score plot are when you can see more than two distinct subgroups in your data.   In our case, as, as we're comparing a group in our current data to the historical data. We could also just select the data after the shift and look at a main contribution score plot.   And this will give us   The average contributions of each variable to the scores in the orange group. And since large scores indicate large difference from the historical data. These contribution plots can also identify the cause.   These are using the same formula is the contribution formula for t squared. But now we're just using the, the two factors from the score plot.   Okay, I'm gonna find my PowerPoint again.   So real quick, I'm going to summarize the key features of the model driven multi variant control chart that were shown in the demo.   The platform is capable of performing both online fault detection and offline fault diagnosis. There are many methods, providing the platform for drilling down to the root cause of the faults.   I'm showing you. Here's some plots from the popular book fault detection and diagnosis in industrial systems throughout the book authors.   Demonstrate how one needs to use multivariate and universal control charts side by side to get a sense of what's going on in the process.   And one particularly useful feature and model driven multivariate control chart is how interactive and user friendly. It is to switch between these types of charts.   So that's my talk here. Here's my email. If you have any further questions, and thanks to everyone who tuned in to watch this.
Roland Jones, Senior Reliability Engineer, Amazon Lab126 Larry George, Engineer who does statistics, Independent Consultant Charles Chen SAE MBB, Quality Manager, Applied Materials Mason Chen, Student, Stanford University OHS Patrick Giuliano, Senior Quality Engineer, Abbott Structural Heart   The novel coronavirus pandemic is undoubtedly the most significant global health challenge of our time. Analysis of infection and mortality data from the pandemic provides an excellent example of working with real-world, imperfect data in a system with feedback that alters its own parameters as it progresses (as society changes its behavior to limit the outbreak). With a tool as powerful as JMP it is tempting to throw the data into the tool and let it do the work. However, using knowledge of what is physically happening during the outbreak allows us to see what features of the data come from its imperfections, and avoid the expense and complication of over-analyzing them. Also, understanding of the physical system allows us to select appropriate data representation, and results in a surprisingly simple way (OLS linear regression in the ‘Fit Y by X’ platform) to predict the spread of the disease with reasonable accuracy. In a similar way, we can split the data into phases to provide context for them by plotting Fitted Quantiles versus Time in Fit Y by X from Nonparametric density plots. More complex analysis is required to tease out other aspects beyond its spread, answering questions like "How long will I live if I get sick?" and "How long will I be sick if I don’t die?". For this analysis, actuarial rate estimates provide transition probabilities for Markov chain approximation to SIR models of Susceptible to Removed (quarantine, shelter etc.), Infected to Death, and Infected to Cured transitions. Survival Function models drive logistics, resource allocation, and age-related demographic changes. Predicting disease progression is surprisingly simple. Answering questions about the nature of the outbreak is considerably more complex. In both cases we make the analysis as simple as possible, but no simpler.     Auto-generated transcript...   Speaker Transcript Roland Jones Hi, my name is Roland Jones. I work for Amazon Lab 126 is a reliability engineer.   When myself and my team   put together our abstracts for the proposal at the beginning of May, we were concerned that COVID 19 would be old news by October.   At the time of recording on the 21st of August, this is far from the case. I really hope that by the time you watch this in October, there will...things will be under control and life will be returning to normal, but I suspect that it won't.   With all the power of JMP, it is tempting to throw the data into the tool and see what comes out. The COVID 19 pandemic is an excellent case study   of why this should not be done. The complications of incomplete and sometimes manipulated data, changing environments, changing behavior, and changing knowledge and information, these make it particularly dangerous to just throw the data into the tool and see what happens.   Get to know what's going on in the underlying system. Once the system's understood, the effects of the factors that I've listed can be taken into account.   Allowing the modeling and analysis to be appropriate for what is really happening in the system, avoiding analyzing or being distracted by the imperfections in the data.   It also makes the analysis simpler. The overriding theme of this presentation is to keep things as simple as possible, but no simpler.   There are some areas towards the end of the presentation that are far from simple, but even here, we're still working to keep things as simple as possible.   We started by looking at the outbreak in South Korea. It had a high early infection rate and was a trustworthy and transparent data source.   Incidentally, all the data in the presentation comes from the Johns Hopkins database as it stood on the 21st of August when this presentation was recorded.   This is a difficult data set to fit a trend line to.   We know that disease naturally grows exponentially. So let's try something exponential.   As you can see, this is not a good fit. And it's difficult to see how any function could fit the whole dataset.   Something that looks like an exponential is visible here in the first 40 days. So let's just fit to that section.   There is a good exponential fit. Roland Jones What we can do is partition the data into different phases and fit functions to each phase separately.   1, 2, 3, 4 and 5.   Partitions were chosen where the curve seem to transition to a different kind of behavior.   Parameters in the fit function were optimized for us in JMP' non linear fit tool. Details of how to use this tool are in the appendix.   Nonlinear also produced the root mean square error results, the sigma of the residuals.   So for the first phase, we fitted an exponential; second phase was logarithmic; third phase was linear; fourth phase, another logarithmic; fifth phase, another linear.   You can see that we have a good fit for each phase, the root main square error is impressively low. However, as partition points were specifically chosen where the curve change behavior, low root mean square area is to be expected.   The trend lines have negligible predictive ability because the partition points were chosen by looking at existing data. This can be seen in the data present since the analysis, which was performed on the 19th of June.   Where extra data is available, we could choose different partition points and get a better fit, but this will not help us to predict beyond the new data.   Partition points do show where the outbreak behavior changes, but this could be seen before the analysis was performed.   Also no indication is given as to why the different phases have a different fit function.   This exercise does illustrate the difficulty of modeling the outbreak, but does not give us much useful information on what is happening or where the outbreak is heading. We need something simpler.   We're dealing with a system that contains self learning.   As we as society, as a society, learn more about the disease, we modify behavior to limited spread, changing the outbreak trajectory.   Let's look into the mechanics of what's driving the outbreak, starting with the numbers themselves and working backwards to see what is driving them.   The news is full of COVID 19 numbers, the USA hits 5 million infections and 150,000 deaths. California has higher infections than New York. Daily infections in the US could top 100,000 per day.   Individual numbers are not that helpful.   Graphs help to put the numbers into context.   The right graphs help us to see what is happening in the system.   Disease grows exponentially. One person infects two, who infect four, who infect eight.   Human eyes differentiate poorly between different kinds of curves but they differentiate well between curves and straight lines. Plotting on a log scale changes the exponential growth and exponentially decline into straight lines.   Also on the log scale early data is now visible where it was not visible on the linear scale. Many countries show one, sometimes two plateaus, which were not visible   in the linear graph. So you can see here for South Korea, there's one plateau, two plateaus and, more recently, it's beginning to grow for third time.   How can we model this kind of behavior?   Let's keep digging.   The slope on the log infections graph is the percentage growth.   Plotting percentage growth gives us more useful information.   Percentage growth helps to highlight where things changed.   If you look at the decline in the US numbers, the orange line here, you can see that the decline started to slacken off sometime in mid April and can be seen to be reversing here in mid June.   This is visible but it's not as clear in the infection graphs. It's much easier to see them in the percentage growth graph.   Many countries show a linear decline in percentage growth when plotted on a log scale. Italy is a particularly fine example of this.   But it can also be seen clearly in China,   in South Korea,   and in Russia, and also to a lesser extent in many other countries.   Why is this happening?   Intuitively, I expect that when behavior changes, growth would drop down to a lower percent and stay there, not exponentially decline toward zero.   I started plotting graphs on COVID 19 back in late February, not to predict the outbreak, but because I was frustrated by the graphs that were being published.   After seeing this linear decline in percentage growth, I started paying an interest in prediction.   Extrapolating that percentage growth line through linear regression actually works pretty well as a predictor, but it only works when the growth is declining. It does not work at all well when the growth is increasing.   Again, going back to the US orange line, if we extrapolate from this small section here, where it's increasing which is from the middle of June to the end...to the beginning of July,   we can predict that we will see 30% increase by around the 22nd of July, that will go up to 100% weekly growth by the 20th...26th of August, and it will keep on growing from there, up and up and up and up.   Clearly, this model does not match reality.   I will come back to this exponential decline in percentage growth later. For now, let's keep looking at the, at what is physically going on as the disease spreads.   People progress from being susceptible to disease to being infected to being contagious   to being symptomatic to being noncontagious to being recovered.   This is the Markoff SIR model. SIR stands for susceptible, infected, recovered. The three extra stages of contagious, symptomatic and noncontagious helped us to model the disease spread and related to what we can actually measure.   Note the difference between infected and contagious. Infected means you have the disease; contagious means that you can spread it to others. It's easy to confuse the two, but they are different and will be used in different ways, further into this analysis.   The timing shown are best estimates and can vary greatly. Infected to symptomatic can be from three to 14 days and for some infected people,   they're never symptomatic.   The only data that we have access to is confirmed infections, which usually come from test results, which usually follow from being symptomatic.   Even if testing is performed on non symptomatic people, there's about a five-day delay from being infected to having a positive test results.   So we're always looking at all data. We can never directly observe observe the true number of people infected.   So the disease progresses through individual individuals from top to bottom in this diagram.   We have a pool of people that are contagious and that pool is fed by people that are newly infected becoming contagious and the pool is drained by people that are contagious becoming non contagious.   The disease spreads spreads to the population from left to right.   New infections are created when susceptible people come into contact with contagious people and become infected.   The newly infected people join the queue waiting to become contagious and the cycle continues.   This cycle is controlled by transmission.   How likely a contagious person is to infect a susceptible person per day.   the number of people that a contagious person is likely to infect while they are contagious.   This whole cycle revolves around the number of people contagious and the transmission or reproduction.   The time individuals stay contagious should be relatively constant unless COVID 19 starts to mutate.   The transmission can vary dramatically depending on social behavior and the size of the susceptible population.   Our best estimate is the days contagious averages out at about nine.   So we can estimate people contagious as the number of people confirmed infected in the last nine days.   In some respects, this is an underestimate because it doesn't include people that are infected, but not yet symptomatic or that are asymptomatic or that don't yet have a positive test result.   In other respects, it's an overestimate because it includes includes people who were infected, a long time ago, but they're only now being tested as positive. It's an estimate.   From the estimate of people contagious, we can derive the percentage growth in contagious. It doesn't matter if the people contagious is an overestimate or underestimate.   As long as the percentage error in the estimate remains constant, the percentage growth in contagious will be accurate.   Percentage growth in contagious, because within use it to derive transmission,   The derivation of this equation relating the two can be found in the appendix.   Know this equation allows you to derive transmission and then reproduction from the percentage growth in contagious, but it cannot tell you the percentage growth in contagious for a given transmission.   This can only be found by solving numerically.   I have outlined outlined how to do this using JMP's fit model tool in the appendix.   Reproduction and transmission are very closely linked, but reproduction has the advanced...advantage of ease of understanding.   If it is greater than one, the outbreak is expanding out of control. Infections will continue to grow and there will be no end in sight.   If it is less than one, the outbreak is contracting, coming under control. There are still new infections, but their number will gradually decline until they hit zero. The end is in sight, though it may be a long way off.   The number of people contagious is the underlying engine that drives the outbreak.   People contagious grows and declines exponentially. We can predict the path of the outbreak by extrapolating this growth or decline in people contagious. Here we have done it for Russia and Italy and for China.   Remember the interesting observation from earlier, the infections percent in growth percentage growth declines exponentially and here's why.   If reproduction is less than one and constant, people contagious will decline exponentially towards zero.   People contagious drives the outbreak.   The percentage growth in infections is proportional to the number of people contagious. So if people contagious declines exponentially, but percentage growth and infections will also decline exponentially. Mystery solved.   The slope of people contagious plotted on log scale gives us the contagious percentage growth, which then gives us transmission and reproduction through the equations on the last slide.   Notice that there's a weekly cycle in the data. This is particularly visible in Brazil, but it's also visible in other countries as well.   This may be due to numbers getting reported differently at the weekends or by people being more likely to get infected at the weekend. Either way, we'll have to take this seasonality into account when using people contagious to predict the outbreak.   Because social behavior is constantly changing, transmission and reproduction changes as well. So we can't use the whole distribution to generate reproduction.   We chose 17 days as the period over which to estimate reproduction. We found that one week was a little too short to filter out all of the noise, two weeks gave a better results, two and a half weeks was even better. Having the extra half week   evened out the seasonality that we saw in the data.   There is a time series forecast tool in JMP that will do all of this for us, including the seasonality, but because we're performing the regression on small sections of the data, we didn't find the tool helpful.   Here is the derived transmission and reproduction numbers.   You can see that they can change quickly.   It is easy to get confused by these numbers. South Korea is showing a significant increase in reproduction, but it's doing well. The US, Brazil, India and South Africa are doing poorly, but seem to have a reproduction of around one or less.   This is a little confusing.   To help reduce the confusion around reproduction, here's a little bit of calculus.   Driving a car, the gas pedal controls acceleration.   To predict where the car is going to be, you need to know where you are, how fast you're traveling and how much you're accelerating or decelerating.   In a similar way to know where the pandemic is going to be, we need to know how many infections there are, which is the equivalent of distance traveled. We need to know how fast the infections are expanding or how many people are contagious, both of which are the equivalent of speed.   We need to know how fast the people contagious is growing, which is a transmission or reproduction, which is the equivalent of acceleration.   There is a slight difference. Distance grows linearly with speed and speed grows linearly with acceleration.   Infections do grow linearly with people contagious, but people contagious grows exponentially with reproduction.   There is a slight difference, but the principle's the same.   The US, Brazil, India and South Africa have all traveled a long distance. They have high infections and they're traveling at high speed. They have high contagious. Even a little bit of acceleration has a very big effect on the number of infections.   South Korea, on the other hand, on the other hand is not going fast, it has low contagious. So has the headroom to respond to the blip in acceleration and get things back under control without covering much distance   Also, when the number of people contagious is low, adding a small number of new contagious people produces a significant acceleration. Countries that have things under control are prone to these blips in reproduction.   You have to take all three factors into account   (number of infections, people contagious and reproduction) to decide if a country is doing well or doing poorly.   Within JMP there are a couple of ways to perform the regression to get the percentage growth of contagious. There's the Fit Y by X tool and there's the nonlinear tool. I have details on how to use both these tools in the appendix. But let's compare the results they produce.   The graphs shown compare the results from both tools. The 17 data points used to make the prediction are shown in red.   The prediction line from both tools are just about identical, though there are some noticeable differences in the confidence lines.   The confidence lines for the non linear, tool are much better. The Fit Y by X tool transposes that data into linear space before finding the best fit straight line.   This results in a lower cost...in the lower conference line pulling closer to the prediction line after transposing back into the original space.   Confidence lines are not that useful when parameters that define the outbreak are constantly changing. Best case, they will help you to see when the parameters have definitely changed.   In my scripts, I use linear regression calculated in column formulas, because it's easy to adjust with variables. This allows the analysis to be adjusted on the fly without having to pull up the tool in JMP.   I don't currently use the confidence lines in my analysis. So I'm working on a way to integrate them into the column formulas.   Linear regression is simpler and produces almost identical results. Once again, keep it simple.   We have seen how fitting an exponential to the number of people contagious can be used to predict whether people contagious will be in the future, and also to derive transmission.   Now that we have a prediction line for people contagious, we need to convert that back into infections.   Remember new infections equals people contagious and multiplied by transmission.   Transmission is the probability that a contagious person will infected susceptible person per day.   The predicted graphs that results from this calculation are shown. Note that South Korea and Italy have low infections growth.   However, they have a high reproduction extrapolated from the last 17 days worth of data. So, South Korea here and Italy here, low growth, but you can see them taking off because of that high reproduction number.   The infections growth becomes significance between two and eight weeks after the prediction is made.   For South Korea, this is unlikely to happen because they're moving slowly and have the headroom to get things back under control.   South Korea has had several of these blips as it opens up and always manages to get things back under control.   In the predicted growth percent graph on the right, note how the increasing percentage growth in South Korea and this leads will not carry on increasing indefinitely, but they plateau out after a while.   Percentage growth is still seen to decline exponentially, but it does not grow exponentially.   It plateaus out.   So to summarize,   the number of people contagious is what drives the outbreak.   This metric is not normally reported, but it's close to the number of new infections over a fixed period of time.   New infections in the past week is the closest regular reported proxy, the number of people contagious. This is what we should be focusing on, not the number of infections or the number of daily new infections.   Exponential regression of people contagious will predict where the contagious numbers are likely to be in the future.   Percentage growth in contagious gives us transmission and reproduction.   The contagious number and transmission number can be combined to predict the number of new infections in the future.   That prediction method assumes the transmission and reproduction are constant, which they aren't. They change their behavior.   But the predictions are still useful to show what will happen if behavior does not change or how much behavior has to change to avoid certain milestones.   The only way to close this gap is to come up with a way to mathematically model human behavior.   If any of you know how to do this, please get in touch. We can make a lot of money, though only for short amount of time.   This is the modeling. Let's check how accurate it is by looking at historical data from the US.   As mentioned, the prediction works well when reproduction's constant but not when it's changing.   If we take a prediction based on data from late April to early May, it's accurate as long as the prediction number stays at around the same level of 1.0   The reproduction number stays around 1.0.   After the reproduction number starts rising, you can see that the prediction underestimates the number of infections.   The prediction based on data from late June to mid July when reproduction was at its peak as states were beginning to close down again,   that prediction overestimates the infections as reproduction comes down.   The model is good at predicting what will happen if behavior stays the same but not when behavior is changing.   How can we predict deaths?   It should be possible to estimate the delay between infection and death.   And the proportion of infections that result in deaths and then use this to predict deaths.   However, changes in behavior such as increasing testing and tracking skews the number of infections detected.   So to avoid this skew also feeding into the predictions for deaths, we can use the exact same mathematics on deaths that we used on infections. As with infections, the deaths graph shows accurate predictions when deaths reproduction is stable.   Note that contagious and reproduction numbers for deaths don't represent anything real.   This method works because because deaths follow infections and so follow the same trends and the same mathematics. Once again, keep it simple.   We have already seen that the model assumes constant reproduction. It also does not take into account herd immunity.   We are fitting an exponential, but the outbreak really follows the binomial distribution.   Binomial and a fitted exponential differ by less than 2% with up to 5% of the population infected. Graphs demonstrating this are in the appendix.   When more than 5% of the population is no longer susceptible due the previous infection or to vaccination, transmission and reproduction naturally decline.   So predictions based on recent reproduction numbers will still be accurate, however long-term predictions based on an old reproduction number with significantly less herd immunity will overestimate the number of infections.   On the 21st of August, the US had per capita infections of 1.7%   If only 34% of infected people have been diagnosed   as infected, and there is data that indicates that this is likely, we are already at the 5% level where herd immunity begins to have a measurable effect.   At 5% it reduces reproduction by about 2%.   What the model can show us, reproduction tells us whether the outbreak is expanding. It's greater than 1, which is the equivalent of accelerating or its contracting, it's less than 1, the equivalent of decelerating.   Estimated number of people contagious tells us how bad the outbreak is, how fast we're traveling.   Per capita contagious is the right metric to choose appropriate social restrictions.   The recommendations for social restrictions though listed on this slide are adapted from those published by the Harvard Global Health Institute. There's a reference in the appendix.   What they recommend is when there's less than 12 people contagious per million, test and trace is sufficient. When we get up to 125 contagious per millio, rigorous test and trace is required   At 320 contagious per million, we need rigorous test and trace and some stay at home restrictions.   Greater than 320 contagious per million, stay at home restrictions are necessary.   At the time of writing, the US had 1,290 contagious per million, down from 1,860 at the peak in late July.   It's instructional to look at the per capita contagious in various countries and states when they decided to reopen.   China and South Korea had just a handful of people contagious per million.   Europe has in the 10s of people contagious per million except for Italy.   The US had hundreds of people contagious per million when they decided to reopen.   We should not really have reopened in May. This was an emotional decision not a data-driven decision.   Some more specifics about the US reopening.   As I said, the per capita contagious in the US, at the time of writing was 1,290 per million.   1,290 per million, with a reproduction of .94.   With this per capita contagious and reproduction, it will take until the ninth of December to get below 320 contatious per million.   The lowest reproduction during the April lockdown was .86.
Shamgar McDowell, Senior Analytics and Reliability Engineer, GE Gas Power Engineering   Faced with the business need to reduce project cycle time and to standardize the process and outputs, the GE Gas Turbine Reliability Team turned to JMP for a solution. Using the JMP Scripting Language and JMP’s built-in Reliability and Survival platform, GE and a trusted third party created a tool to ingest previous model information and new empirical data which allows the user to interactively create updated reliability models and generate reports using standardized formats. The tool takes a task that would have previously taken days or weeks of manual data manipulation (in addition to tedious copying and pasting of images into PowerPoint) and allows a user to perform it in minutes. In addition to the time savings, the tool enables new team members to learn the modeling process faster and to focus less on data manipulation. The GE Gas Turbine Reliability Team continues to update and expand the capabilities of the tool based on business needs.       Auto-generated transcript...   Speaker Transcript Shamgar McDowell Maya Angelou famously said, "Do the best you can, until you know better. Then when you know better, do better." Good morning, good afternoon, good evening. I hope you're enjoying the JMP Discovery Summit, you're learning some better way ways of doing the things you need to do. I'm Shamgar McDowell, senior reliability and analytics engineer at GE Gas Power. I've been at GE for 15 years and have worked in sourcing, quality, manufacturing and engineering. Today I'm going to share a bit about our team's journey to automating reliability modeling using JMP. Perhaps your organization faces a similar challenge to the one I'm about to describe. As I walk you through how we approach this challenge, I hope our time together will provide you with some things to reflect upon as you look to improve the workflows in your own business context. So by way of background, I want to spend the next couple of slides, explain a little bit about GE Gas Power business. First off, our products. We make high tech, very large engines that have a variety of applications, but primarily they're used in the production of electricity. And from a technology standpoint, these machines are actually incredible feats of engineering with firing temperatures well above the melting point of the alloys used in the hot section. A single gas turbine can generate enough electricity to reliably power hundreds of thousands of homes. And just to give an idea of the size of these machines, this picture on the right you can see there's four adult human beings, which just kind of point to how big these machines really are. So I had to throw in a few gratuitous JMP graph building examples here. But the bubble plot and the tree map really underscore the global nature of our customer base. We are providing cleaner, accessible energy that people depend upon the world over, and that includes developing nations that historically might not have had access to power and the many life-changing effects that go with it. So as I've come to appreciate the impact that our work has on everyday lives of so many people worldwide, it's been both humbling and helpful in providing a purpose for what I do and the rest of our team does each day. So I'm part of the reliability analytics and data engineering team. Our team is responsible for providing our business with empirical risk and reliability models that are used in a number of different ways by internal teams. So in that context, we count on the analyst in our team to be able to focus on engineering tasks, such as understanding the physics that affect our components' quality and applicability of the data we use, and also trade offs in the modeling approaches and what's the best way to extract value from our data. These are, these are all value added tasks. Our process also entails that we go through a rigorous review with the chief engineers. So having a PowerPoint pitch containing the models is part of that process. And previously creating this presentation entailed significant copying and pasting and a variety of tools, and this was both time consuming and more prone to errors. So that's not value added. So we needed a solution that would provide our engineers greater time to focus on the value added tasks. It would also further standardize the process because those two things greater productivity and ability to focus on what matters, and further standardization. And so to that end, we use the mantra Automate the Boring Stuff. So I wanted to give you a feel for the scale of the data sets we used. Often the volume of the data that you're dealing with can dictate the direction you go in terms of solutions. And in our case, there's some variation but just as a general rule, we're dealing with thousands of gas turbines in the field, hundreds of track components in each unit, and then there's tens of inspections or reconditioning per component. So in in all, there's millions of records that we're dealing with. But typically, our models are targeted at specific configurations and thus, they're built on more limited data sets with 10,000 or fewer records, tens of thousands or fewer records. The other thing I was going to point out here is we often have over 100 columns in our data set. So there are challenges with this data size that made JMP a much better fit than something like an Excel based approach to doing this the same tasks. So, the first version of this tool, GE worked with a third party to develop using JMP scripting language. And the name of the tool is computer aided reliability modeling application or CARMA, with a c. And the amount of effort involved with building this out to what we have today is not trivial. This is a representation of that. You can see the number of scripts and code lines that testified to the scope and size of the tool as it's come to today. But it's also been proven to be a very useful tool for us. So as its time has gone on, we've seen the need to continue to develop and improve CARMA over time. And so in order to do this, we've had to grow and foster some in-house expertise in JSL coding and I oversee the work of developers that focus on this and some related tools. Message on this to you is that even after you create something like CARMA, there's going to be an ongoing investment required to maintain and keep the app relevant and evolve it as your business needs evolve. But it's both doable and the benefits are very real. A survey of our users this summer actually pointed to a net promoter score of 100% and at least 25% reduction in the cycle time to do a model update. So that's real time that's being saved. And then anecdotally, we also see where CARMA has surfaced issues in our process that we've been able to address that otherwise might have remained hidden and unable to address. And I have a quote, it's kind of long. But I wanted to just pass this caveat on automation from Bill Gates, on which he knows a thing or two about software development. "The first rule of any technology used in business is that automation applied to an efficient automation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency." So that's the end of the quote, but this is just a great reminder that automation is not a silver bullet that will fix a broken process, we still need people to do that today. Okay, so before we do a demonstration of the tool. I just wanted to give a high level overview of of the tool and the inputs and outputs in CARMA. And user user has to point the tool to the input files. So over here on the left, you see we have an active models file that's essentially the already approved models. And then we have empirical data. And then in the user interface, the user does some modeling activities. And then outputs are running models, so updates to the act of models in a PowerPoint presentation. And we'll also look at that. As a background for the data, I'll be using the demo. I just wanted to pass on that I started with the locomotive data set. And we'll see that JMP provides some sample data. So this is the case, there and that gives one population. Then I also added into additional population of models. And the big message here I wanted to pass on is that what we're going to see is all made-up data. It's not real; it doesn't represent the functionality or the behavior of any of our parts in the field, or it's it's just all contrived. So keep that in mind as we go through the results, but it should give us a way to look at the tool, nonetheless. So I'm going to switch over to JMP for a second and I'm using JMP 15.2 for the demo. And this data set is simplified compared to what we normally see. But like I said, it should exercise the core functionality in CARMA. So first, I'm just going to go to the Help menu, sample data, and you'll see the reliability and survival menu here. So that's where we're going. One of the nice things about JMP is that it has a lot of different disciplines and functionality and specialized tools for that. And so for my case with reliability, there's a lot here, which also lends to the value of using JMP is a home for CARMA. But I wanted to point you to the locomotive data set and just show you... this originally came out of a textbook. And talks to it here applied life data analysis. So, in that, there's a problem that asks you what the risk is at 80,000 exposures and we're going to model that today in our data set in an oxidation model is what we've called it, but essentially CARMA will give us an answer. Again, a really simple answer, but I was just going to show you, you can get the same way by clicking in the analysis menu. So we go down to an analyze or liability and survival, life distribution. Put the time and sensor where they need to go. We're going to use Weibull and just the two...so it creates a fit for that data. Two parameters I was going to point out is the beta, 2.3, and then it's called a Weibull alpha here. In our tool, it'll be called Ada, but 183. Okay, so we see how to do that here. Now just to jump over, want to look at a couple of the other files, the input files so I will pull those up. Okay, this is the model file. I mentioned I made three models. And so these are the active models that we're going to be comparing the data against. You'll see that oxidation is the first one, I mentioned that, and then you want...one also, in addition to having model parameters, it has some configuration information. This is just two simple things here (combustion system, fuel capability) I use for examples, but there's many, many more columns, like it. But essentially what CARMA does, one of the things I like about it is when you have a large data set with a lot of different varied configurations, it can go through and find which of those rows of records applies to your model and do the sorting real time, and you know, do that for all the models that you need to do in the data set. And so that's what we're going to use that to demonstrate. Excuse me. Also, just look, jump over to the empirical data for a minute. And just a highlight, we have a sensor, we have exposures, we have the interval that we're going to evaluate those exposures at, modes, and then these are the last two columns I just talked about, combustion system and fuel capability. Okay, so let's start up CARMA. As an add in, so I'll just get it going. And you'll see I already have it pointing to the location I want to use. And today's presentation, I'm not gonna have time to talk through all the variety of features that are in here. But these are all things that can help you take and look at your data and decide the best way to model it, and to do some checks on it before you finalize your models. For the purposes of time, I'm not going to explain all that and demonstrate it, but I just wanted to take a minute to build the three models we talked about create a presentation so you can see that that portion of the functionality. Excuse me, my throat is getting dry all the sudden so I have to keep drinking; I apologize for that. So we've got oxidation. We see the number of failures and suspensions. That's the same as what you'll see in the text. Add that. And let's just scroll down for a second. That's first model added Oxidation. We see the old model had 30 failures, 50 suspensions. This one has 37 and 59. The beta is 2.33, like we saw externally and the ADA is 183. And the answer to the textbook question, the risk of 80,000 exposures is about 13.5% using a Weibull model. So that's just kind of a high level of a way to do that here. Let's look at also just adding the other two models. Okay, we've got cracking, I'm adding in creep. And you'll see in here there's different boxes presented that represent like the combustion system or the fuel capability, where for this given model, this is what the LDM file calls for. But if I wanted to change that, I could select other configurations here and that would result in changing my rows for FNS as far as what gets included or doesn't. And then I can create new populations and segment it accordingly. Okay, so we've gotten all three models added and I think, you know, we're not going to spend more time on that, just playing with the models as far as options, but I'm gonna generate a report. And I have some options on what I want to include into the report. And I have a presentation and this LDM input is going to be the active models, sorry, the running models that come out as a table. All right, so I just need to select the appropriate folder where I want my presentation to go And now it's going to take a minute here to go through and and generate this report. This does take a minute. But I think what I would just contrast it to is the hours that it would take normally to do this same task, potentially, if you were working outside of the tool. And so now we're ready to finalize the report. Save it. And save the folder and now it's done. It's, it's in there and we can review it. The other thing I'll point out, as I pull up, I'd already generated this previously, so I'll just pull up the file that I already generated and we can look through it. But there's, it's this is a template. It's meant for speed, but this can be further customized after you make it, or you can leave placeholders, you can modify the slides after you've generated them. It's doing more than just the life distribution modeling that I kind of highlighted initially. It's doing a lot of summary work, summarizing the data included in each model, which, of course, JMP is very good for. It, it does some work comparing the models, so you can do a variety of statistical tests. Use JMP. And again, JMP is great at that. So that, that adds that functionality. Some of the things our reviewers like to see and how the models have changed year over year, you have more data, include less. How does it affect the parameters? How does it change your risk numbers? Plots of course you get a lot of data out of scatter plots and things of that nature. There's a summary that includes some of the configuration information we talked about, as well as the final parameters. And it does this for each of the three models, as well as just a risk roll up at the end for for all these combined. So that was a quick walkthrough. The demo. I think we we've covered everything I wanted to do. Hopefully we'll get to talk a little more in Q&A if you have more questions. It's hard to anticipate everything. But I just wanted to talk to some of the benefits again. I've mentioned this previously, but we've seen productivity increases as a result of CARMA, so that's a benefit. Of course standardization our modeling process is increased and that also allows team members who are newer to focus more on the process and learning it versus working with tools, which, in the end, helps them come up to speed faster. And then there's also increased employee engagement by allowing engineers to use their minds where they can make the biggest impact. So I also wanted to be sure to thank Melissa Seely, Brad Foulkes, Preston Kemp and Waldemar Zero for their contributions to this presentation. I owe them a debt of gratitude for all they've done in supporting it. And I want to thank you for your time. I've enjoyed sharing our journey towards improvement with you all today. I hope we have a chance to connect in the Q&A time, but either way, enjoy the rest of the summit.  
Sam Edgemon, Analyst, SAS Institute Tony Cooper, Principal Analytical Consultant, SAS   The Department of Homeland Security asked the question, “how can we detect acts of biological terrorism?” After discussion and consideration, our answer was “If we can effectively detect an outbreak of a naturally occurring event such as influenza, then we can find an attack in which anthrax was used because both present with similar symptoms.” The tools that were developed became much more relevant to the detection of naturally occurring outbreaks, and JMP was used as the primary communication tool for almost five years of interactions with all levels of the U.S. Government. In this presentation, we will demonstrate how those tools developed then could have been used to defer the affects of the Coronavirus COVID-19. The data that will be used for demonstration will be from Emergency Management Systems, Emergency Departments and the Poison Centers of America.     Auto-generated transcript...   Speaker Transcript Sam Edgemon Hello. This is Sam Edgemon. I worked for the SAS Institute, you know, work for the SAS Institute, because I get to work on so many different projects.   And we're going to tell you about one of those projects that we worked on today. Almost on all these projects I work on I work with Tony Cooper, who's on the screen. We've worked together really since since we met at University of Tennessee a few years ago.   And the things we learned at the University of Tennessee we've we've applied throughout this project. Now this project was was done for the Department of Homeland Security.   The Department of Homeland Security was very concerned about biological terrorism and they came to SAS with the question of how will we detect acts of biological terrorism.   Well you know that's that's quite a discussion to have, you know, if you think about   the things we might come back with. You know, one of those things was well what do you, what are you most concerned with what does, what do the things look like   that you're concerned with? And they they talked about things like anthrax, and ricin and a number of other very dangerous elements that terrorists could use to hurt the American population.   Well, we took the question and and their, their immediate concerns and researched as best we could concerning anthrax and ricin, in particular.   You know, our research involved, you know, involved going to websites and studying what the CDC said were symptoms of anthrax, and the symptoms of   ricin and and how those, those things might present in a patient that walks into the emergency room or or or or takes a ride on an ambulance or calls a poison center or something like that happens. So what we realized in going through this process was   was that the symptoms look a lot like influenza if you've been exposed to anthrax. And if you've been exposed to ricin, that looks a lot like any type of gastrointestinal issue that you might might experience. So we concluded and what our response was to Homeland Security was that   was that if we can detect an outbreak of influenza or an outbreak of the, let's say the norovirus or some gastrointestinal issue,   then we think we can we can detect when when some of these these bad elements have been used out in the public. And so that's the path we took. So we we took data from EMS and and   emergency rooms, emergency departments and poison centers and we've actually used Google search engine data as well or social media data as well   to detect things that are you know before were thought as undetectable in a sense. But but we developed several, several tools along the way. And you can see from the slide I've got here some of the results of the questions   that that we that we put together, you know, these different methods that we've talked about over here. I'll touch on some of those methods in the brief time we've got to talk today, but let's let's dive into it. What I want to do is just show you the types of conversations we had   using JMP. We use JMP throughout this project to to communicate our ideas and communicate our concerns, communicate what we were seeing. An example of that communication could start just like this, we, we had taken data from from the EMS   system, medical system primarily based in North Carolina. You know, SAS is based in North Carolina, JMP is based in North Carolina in Cary and   and some of them, some of the best data medical data in the country is housed in North Carolina. The University of North Carolina's got a lot to do that.   In fact, we formed a collaboration between SAS and the University of North Carolina and North Carolina State University to work on this project for Homeland Security that went on for almost five years.   But what what I showed them initially was you know what data we could pull out of those databases that might tell us interesting things.   So let's just walk, walk through some of those types of situations. One of the things I initially wanted to talk about was, okay let's let's look at cases. you know,   can we see information in cases that occur every, every day? So you know this this was one of the first graphs I demonstrated. You know, it's hard to see anything in this   and I don't think you really can see anything in this. This is the, you know, how many cases   in the state of North Carolina, on any given day average averages, you know, 2,782 cases a day and and, you know, that's a lot of information to sort through.   So we can look at diagnosis codes, but some of the guys didn't like the idea that this this not as clear as we want want it to be so so we we had to find ways to get into that data and study   and study what what what ways we could surface information. One of those ways we felt like was to identify symptoms, specific symptoms related to something that we're interested in,   which goes back to this idea that, okay we've identified what anthrax looks like when someone walks in to the emergency room or takes a ride on an ambulance or what have you.   So we have those...if we identify those specific symptoms, then we can we can go and search for that in the data.   Now a way that we could do that, we could ask professionals. There was there's rooms full of of medical professionals on this, on this project and and lots of physicians. And kind of an odd thing that   I observed very quickly was when you asked a roomful of really, really smart people question like, what what is...what symptoms should I look for when I'm looking for influenza or the norovirus, you get lots and lots of different answers.   So I thought, well, I would really like to have a way to to get to this information, mathematically, rather than just use opinion. And what I did was I organized the data that I was working with   to consider symptoms on specific days and and the diagnosis. I was going to use those diagnosis diagnosis codes.   And what I ended up coming out with, and I set this up where I could run it over and over, was a set of mathematically valid symptoms   that we could go into data and look and look for specific things like influenza, like the norovirus or like anthrax or like ricin or like the symptoms of COVID 19.   This project surfaced again with with many asks about what we might...how we might go about finding the issues   of COVID 19 in this. This is exactly what I started showing again, these types of things. How can we identify the symptoms? Well, this is a way to do that.   Now, once we find these symptoms, one of the things that we do is we will write code that might look something similar to this code that will will look into a particular field in one of those databases and look for things that we found in those analyses that we've   that we've just demonstrated for you. So here we will look into the chief complaint field in one of those databases to look for specific words   that we might be interested in doing. Now that the complete programs would also look for terms that someone said, Well, someone does not have a fever or someone does not have nausea. So we'd have to identify   essentially the negatives, as well as the the pure quote unquote symptoms in the words. So once we did that, we could come back to   JMP and and think about, well, let's, let's look at, let's look at this information again. We've got we've got this this number of cases up here, but what if we took a look at it   where we've identified specific symptoms now   and see what that would look like.   So what I'm actually looking for is any information regarding   gastrointestinal issues. I could have been looking for the flu or anything like that, but this is this is what the data looks like. It's the same data. It's just essentially been sculpted to look like you know something I'm interested in. So in this case, there was an outbreak   of the norovirus that we told people about that they didn't know about that, you know, we started talking about this on January 15.   And and you know the world didn't know that there was a essentially an outbreak of the norovirus until we started talking about it here.   And that was, that was seen as kind of a big deal. You know, we'd taken data, we'd cleaned that data up and left the things that we're really interested in   But we kept going. You know that the strength of what we were doing was not simply just counting cases or counting diagnosis codes, we're looking at symptoms that that describe the person's visit to   the emergency room or what they called about the poison center for or they or they took a ride on the ambulance for.   chief complaint field, symptoms fields,   and free text fields. We looked into the into the fields that described the words that an EMS tech might use on the scene. We looked in fields that describe   the words that a nurse might use whenever someone first comes into the emergency room, and we looked at the words that a physician may may use. Maybe not what they clicked on the in in the boxes, but the actual words they used. And we we developed a metric around that as well.   This metric   was, you know, it let us know   you know, another month in advance that something was was odd in a particular area in North Carolina on a particular date. So I mentioned this was January 15 and this, this was December 6   and it was in the same area. And what is really registering is is the how much people are talking about a specific thing and if one person is talking about it,   it's not weighted very heavily, therefore, it wouldn't be a big deal. If two people are talking about it, if a nurse   and an EMS tech are talking about a specific set of symptoms, or mentioning a symptom several times, then, then we're measuring that and we're developing a metric from that information.   So if three people, you know, the, the doctor, the nurse and the EMS tech if that's what information we have is, if they're all talking about it,   then it's probably a pretty big deal. So that's what's happened here on December 6, a lot of people are talking about symptoms that would describe something like the norovirus.   This, this was related to an outbreak that the media started talking about in the middle of February. So, so this is seen as...as us telling the world about something that the media started talking about, you know, in a month later.   And   specific specifically you know, we were drawn to this Cape Fear region because a lot of the cases were we're in that area of North Carolina around Wilson,   Wilson County and that sort of thing. So, so that that was seen as something of interest that we could we could kind of drill in that far in advance of, you know, talk about something going on. Now   we carried on with that type of work concerning um, you know, using those tools for bio surveillance.   But what what we did later was, you know, after we set up systems that would that would, you know, was essentially running   every day, you know every hour, every day, that sort of thing. And then so whenever we would be able to say, well,   the system has predicted an outbreak, you know if this was noticed. The information was providing...was was really noise free in a sense. We we look back over time and we was   predicting let's say, between 20 and 30 alerts a year,   total alerts a year. So there was 20 or 30 situations where we had just given people, the, the, the notice that they might should look into something, you know, look, check something out. There might be you know a situation occurring. But in one of these instances,   the fellow that we worked with so much at Homeland Security came to us and said, okay, we believe your alert, so tell us something more about it. Tell us what   what it's made up of. That's that's that's how he put the question. So, so what we we did   was was develop a model, just right in front of him.   And the reason we were able to do that (and here's, here's the results of that model), the reason we were able to do that was by now, we realized the value of   of keeping data concerning symptoms relative to time and place and and all the different all the different pieces of data we could keep in relation to that, like age, like ethnicity.   So when we were asked, What's it made up of, then then we could... Let's put this right in the middle of the screen, close some of the other information around us here so you can just focus on that.   So when we're asked, okay, what's this outbreak made up of, you know, we, we built a model in front of them (Tony actually did that)   and that that seemed to have quite an impact when he did this, to say, Okay, you're right. Now we've told you today there there's there's an alert.   And you should pay attention to influenza cases in this particular area because it appears to be abnormal. But we could also tell them now that, okay   these cases are primarily made up of young people, people under the age of 16.   The symptoms, they're talking about when they go into emergency room or get on an ambulance is fever, coughing, respiratory issues. There's pain.   and there's gastrointestinal issues. The, the key piece of information we feel like is is the the interactions between age groups and the symptoms themselves.   While this one may, you know, it may not be seen as important is because it's down the list, we think it is,   and even these on down here. We talked about young people and dyspnea, and young people and gastro issues, and then older people.   So there was, you know, starting to see older people come into the data here as well. So we could talk about younger people, older people and and people in their   20s, 30s, 40s and 50s are not showing up in this outbreak at this time. So there's a couple of things here. When we could give people you know intel on the day of   of an alert happening and we could give them a symptom set to look for. You know when COVID 19 was was well into our country, you know you you still seem to turn on the news everyday and hear of a different symptom.   This is how we can deal with those types of things. You know, we can understand   you know, what what symptoms are surfacing such that people may may actually have, you know, have information to recognize when a problem is actually going to occur and exist.   So, so this is some of the things that you know we're talking about here, you'll think about how we can apply it now.   Using the the systems of alerting that I showed you earlier that, you know, I generally refer to as the TAP method as just using text analytics and proportional charting.   Well, you know, that's we're probably beyond that now, it's it's on us. So we didn't have the tool in place to to go looking then.   But these types of tools may still help us to be able to say, you know, this is these are the symptoms we're looking for. These are the   these are the age groups were interested in learning about as well. So, so let's let's keep walking through some ways that we could use what we learned back on that project to to help the situation with COVID 19.   One of the things that we did of course we've we've talked about building this this the symptoms database. The symptoms database is giving us information on a daily basis about symptoms that arise.   And and you know who's, who's sick and where they're sick at. So here's an extract from that database that we talked about, where it it has information on a date,   it has information about gender, ethnicity, in regions of North Carolina. We could you take this down to towns and and the zip codes or whatever was useful.   This I mentioned TAP in that text analytics information, well now we've got TAP information on symptoms. You know, so if people are talking about   this, say for example, nausea, then we we know how many people are talking about nausea on a day, and eventually in a place. And so this is just an extract of symptoms from   from this   this database. So, so let's take a look at how we could use this this. Let's say you wanted to come to me, an ER doctor, or some someone investigating COVID 19 might come to me and say,   well, where are people getting sick at. You know, that's where are people getting sick   now, or where might an outbreak be occurring in a particular area. Well, this is the type of thing we might do to demonstrate that.   I use Principal Components Analysis a lot. In this case because we've got this data set up, I can use this tool to identify   the stuff I'm interested in analyzing. In this case it's the regions, they asked, you know, the question was where, where and what. Okay what what are you interested in knowing about? So I hear people talk about respiratory issues   concerning COVID and I hear people talking about having a fever and and these are kind of elevated symptoms. These are issues that people are talking about   even more than they're writing things down. That's the idea of TAP is, is we're getting into those texts fields and understanding understanding interesting things. So once we we   we run this analyses,   JMP creates this wonderful graph for us. It's great for communicating what's going on. And what's going on in this case is that Charlotte, North Carolina,   is really maybe inundated with with with physicians and nurses and maybe EMS techs talking about their patients having a fever   and respiratory issues. If you want to get as far as you can away from that, you might spend time in Greensboro or Asheville, and if you're in Raleigh Durham, you might be aware of what's on the way.   So that this is this is a way that we can use this type of information for   for essentially intelligence, you know, intelligence into what what might be happening next in specific areas. We could also talk about severity in the same, in the same instance. We could talk about severity of cases and measure where they are the same way.   So you know the the keys here is is getting the symptoms database organized and utilized.   We've we use JMP to communicate these ideas. A graph like this may may have been shown to Homeland Security and we talked about it for two hours easily just with, not just questions about even validity,   you know, is where the data come from and so forth. We could talk about that and and we could also talk about   okay, this, this is the information that that you need to know, you know. This is information that will help you understand where people are getting sick at, such that warnings can be given and essentially life...lives saved.   So, so that's that in a sense is the system that we've we put together. The underlying key is, is the data.   Again, the data we've used is EMS, ED, poison center data. I don't have an example of the poison center data here, but I've got a long talk about how we how we use poison center data to surface foodborne illness, just in similar ways than what we've shown here.   And then the ability to, to, to be fairly dynamic with developing our story in front of people and talking to them   in, you know, selling belief in what we do. JMP helps us do that; SAS code helps us do that. That's a good combination tools and that's all I have for this this particular   topic. I appreciate your attention and hope you find it useful, and hope we can help you with this type of stuff. Thank you.
Monday, October 12, 2020
Charles Whitman, Reliability Engineer, Qorvo   Simulated step stress data where both temperature and power are varied are analyzed in JMP and R. The simulation mimics actual life test methods used in stressing SAW and BAW filters. In an actual life test, the power delivered to a filter is stepped up over time until failure (or censoring) occurs at a fixed ambient temperature. The failure times are fitted to a combined Arrhenius/power law model similar to Black’s equation. Although stepping power simultaneously increases the device temperature, the algorithm in R is able to separate these two effects. JMP is used to generate random lognormal failure times for different step stress patterns. R is called from within JMP to perform maximum likelihood estimation and find bootstrap confidence intervals on the model estimates. JMP is used live to plot the step patterns and demonstrate good agreement between the estimates and confidence bounds to the known true values.  A safe-operating-area (SOA) is generated from the parameter estimates.  The presentation will be given using a JMP journal.   The following are excerpts from the presentation.                                     Auto-generated transcript...   Speaker Transcript CWhitman All right. Well, thank you very much for attending my talk. My name is Charlie Whitman. I'm at Corvo and today I'm going to talk about steps stress modeling in JMP using R. So first, let me start off with an introduction. I'm going to talk a little bit about stress testing and what it is and why we do it. There are two basic kinds. There's constant stress and step stress; talk a little bit about each. Then when we get out of the results from the step stress or constant stress test are estimates of the model parameters. That's what we need to make predictions. So in the stress testing, we're stressing parts of very high stress and then going to take that data and extrapolate to use conditions, and we need model parameters to do that. But model parameters are only half the story. We also have to acknowledge that there's some uncertainty in those estimates and we're going to do that with confidence bounds and I'm gonna talk about a bootstrapping method I used to do that. And at the end of the day, armed with our maximum likelihood estimates and our bootstrap confidence bounds, we can create something called the safe operating area, SOA, which is something of a reliability map. You can also think of it as a response surface. So we're going to get is...find regions where it's safe to operate your part and regions where it's not safe. And then I'll reach some conclusions. So what is a stress test? In a stress test you stress parts until failure. Now sometimes you don't get failure; sometimes parts, you have to start, stop the test and do something else. And that case, you have a sensor data point, but the method of maximum likelihood, which are used in the simulations takes sensoring into account so you don't have to have 100% failure. We can afford to have some parts not fail. So what you, what you do is you stress these parts under various conditions, according to some designed experiment or some matrix or something like that. So you might run your stress might be temperature or power or voltage or something like that and you'll run your parts under various conditions, various stresses and then take those that data fitted to your model and then extrapolate to use conditions. mu = InA + ea/kT. Mu is the log mean of your distribution; we commonly use the lognormal distribution. That's going to be a constant term plus the temperature term. You can see that mu is inversely related to temperature. So as temperature goes up, mu goes down, and that's temperature goes down, mu goes up. If we can use the lognormal, you will also have an additional parameter that the shape factor sigma. So after we run our test, we will run several parts under very stressed conditions and we fit them to our model. It's then when you combine those two that you can predict behavior at use conditions, which is really the name of the game. The most common method is a is a constant stress test, and what basically, the stress is fixed for the duration of the test. So this is just showing an example of that. We have a plot here of temperature versus time. If we have a very low temperature, say you could get failures that would last time...that sometimes be very long. The failure times can be random, again according to, say, some distribution like the lognormal. If we increase the temperature to some higher level, we would get end of the distribution of failure times, but on the average the failure times would be shorter. And if we increase the temperature even more, same kind of thing, but failure times are even shorter than that. So what I can do is if I ran, say, a bunch of parts under these different temperatures, I could fit the results to a probability plot that looks like this. I have probability versus time to failure at the highest temperature here. This example is 330 degrees C, I have my set of failure times which I set to lognormal. And then as I decrease the temperature lower and lower the failure times get longer and longer. Then I take all this data over temperature I fit it to the Arrhenius model, I extrapolate. And then I see I can get my predictions at use conditions. This is what we are after. I want to point out that when we're doing these accelerated testing, this test, we have to run at very high stress because, for example, even though this is, say, lasting 1000 hours or so, our predictions are that the part under use conditions would be a billion hours and there's no way that we could run test for a billion hours. So we have to get tests done in a reasonable amount of time and that's why we're doing accelerated testing. So then, what is a step stress? Well, as you might imagine, a step stresses where you increase the stress in steps or some sort of a ramp. The advantage is that it's a real time saver. As I showed in the previous plot, those tests could last a very long time that could be 1000 hours. So that's it could be weeks or months before the test is over. A step stress test could be much shorter or you might be able to get done in hours or days. So it's a real time saver. But the analysis is more difficult and I'll show that in a minute. So, in the work we've done at Corvo, we're doing reliability of acoustic filters and those are those are RF devices. And so the stress in RF is RF power. And so we step up power until failure. So if we're going to step up power, we can do is we can model this with this expression here. Basically, we had the same thing as the Arrhenius equation, but we're adding another term, n log p. N is our power acceleration parameter; p is our power. So for the lognormal distribution, there would be a fourth parameter, sigma, which is the shape factors. So you have 1, 2, 3, 4 parameters. Let me just give you a quick example of what this would look like. You start, this is power versus time. Power is in dBm. You're starting off at some power like 33.5 dBm, you step and step and step and step until hopefully you get failure. And I want to point out that your varying power, and as you increase the power to the part, that's going to be changing the temperature. So as power is ramped, so it is temperature. So power and temperature are then confounded. So you're gonna have to do your experiment in such a way that you can separate the effects of temperature and power. So I want to point out that you have these two terms (temperature and power), so it's not just that I increase the power to the part and it gets hotter and it's the temperature that's driving it. It's power in and of itself also increases the failure rate. Right. So now if I show a little bit more detail about that step stress plot. So here again a power versus time. I'm running a part for, say, five hours at some power, then I increase the stress, and run another five hours, and increase the stress on up until like a failure. So, and as I mentioned as the power is increasing, so is the temperature. So I have to take that into account somehow. I have to know what the t = T ambient + R th times p T ambient is our ambient temperature; P is the power; and R th is called the thermal impedance which is a constant. So, that means, as I set the power, so I know what the power is and then I can also estimate what the temperature is for each step. So what we'd like to do is then take somehow these failure times that get from our step stress pattern and extrapolate that to use conditions. If I was only running, like, for time delta t here only and I wanted to extrapolate that to use conditions, what I would do is I would multiply...get the equivalent amount of time delta t times the acceleration factor. And here's the acceleration factor. I have an activation energy term, temperature term, and a power term. And so what I would do is I would multiply by AF. And since I'm going from high stress down to low stress, AF is larger than one and this is just for purposes of illustration, it's not that much bigger than one, but you get the idea. And as I increase the power, temperature and power are changing so the AF changes with each step. So if I want to then get the equivalent time at use conditions, I'd have to do a sum. So I have each segment. It has its own acceleration factor and maybe its own delta t. And then I do a sum and that gives me the equivalent time. So this, this expression that I would use them to predict equipment time if I knew exactly what Ea was and exactly what n was, I could predict what the equivalent time was. So that's the idea. So it turns out that....so as I said, temperature and power are confounded. So in order to estimate, what we do is we have to run to two different ambient temperatures If you have the ambient temperatures separated enough, then you can actually separate the effects of power and temperature. You also need at least two ramp rates. So at a minimum, you would need a two by two matrix of ramp rate and ambiant temperature. In the simulations I did, I chose three different rates as shown here. I have power in dBm versus stress time And I have three different ramps, but with different rates. I'll have a fast, a medium, and a slow ramp rate. In practice, you would let this go on and on and on until failure, but I've only just arbitrarily cut it off after a few hours. You see here also I have a ceiling. The ceiling is four; it's because we have found that if we increase the stress or power arbitrarily, we can change the failure mechanism. And what you want to do is make sure that failure mechanism, when you're under accelerate conditions is the same as it is under use conditions. And if I change the failure mechanism that I can't do an extrapolation. The extrapolation wouldn't be valid. So we had the ceiling here of drawn to 34.4 dBm, and we even given ourselves a little buffer to make sure we don't get close to that. So our ambient temperature is 45 degrees C, we're starting it a power 33.5 dBm so we would also have another set of conditions at 135 degrees. See, you can see the patterns here are the same. And we have a ceiling and they have a buffer region, everything, except we are starting at a lower power. So here we're below 32 dBm, whereas before we were over 33. And the reason we do that is because if we don't lower the power at this higher temperature, what will happen is you'll get failures almost immediately if you're not careful, and then you can't use the data to do your extrapolation. Alright, so what we need, again, is an expression for our quivalent time, as I showed that before. Here's that expression. This is kind of nasty and I would not know how to derive from first principles of what the expression is for the distribution of the equivalent time of use conditions. So, when faced with something which is kind of difficult like that, what I choose to do was use the bootstrap. So what is bootstrapping? So with bootstrapping, what we're doing is we are resampling the data set many times with replacement. That means from the original data set of observations, you can have replicates of from the original data set or maybe an observation won't appear all. And the approach I use is called non parametric, because we're not assuming the distribution. We don't have to know the underlying distribution of the data. So when you generate these many bootstrap samples, which you can get as an approximate distribution of the parameter, and that allows you to do statistical inference. In particular, we're interested in putting in confidence bounds on things. So that's what we need to do. Simple example of bootstrapping is called percentile bootstrap. So, for example, suppose I wanted 90% confidence bounds on some estimate. And I would do is I would form, many, many bootstrap replicates and I would extract the parameter from each bootstrap sample. And then I would sort that and I would figure out which is the shift and 95th percentile from that vector and those would form my 90% confidence bounds. What I did actually in my work was I used an improvement over to percentile, a technique. It's called the BCa for bias corrected and accelerated. Bias because sometimes our estimates are biased and this method would take that into account. Accelerated, unfortunately the term accelerated is confusing here. It has nothing to do with accelerated testing, it has to do with the method, the method has to do for with adjusting for the skewness of the distribution. But ultimately you're...what you're going to get is it's going to pick for you different percentile values. So, again, for the percentile technique we had fifth and 95th. The bootstrap or the BCa bootstrap might give you something different, might say the third percentile and 96% or whatever. And those are the ones who would need to choose for your 90% confidence bounds. So I just want to run through a very quick example just to make this clear. Suppose I have 10 observations and I want to do for bootstrap samples from this, looking something like this. So, for example, the first observation here 24.93 occurs twice in the first sample, once in the second sample, etc. 25.06 occurs twice. 25.89 does not occur at all and I can do this, in this case, 100 times And for each bootstrap sample then, I'm going to find, in this case I'm gonna take the average, say, I'm interested in the distribution of the average. Well, here I have my distribution of averages. And I can look to see what that looks like. Here we are. It looks pretty bell shaped and I have a couple points here, highlighted and these would be my 90% confidence bounds if I was using the percentile technique. So here's this is the sorted vector and the fifth percentile is at 25.84 and the 95th percentile is 27.68. If I wanted to do the BCa method, I would might just get some sort of different percentile. So this case, 25.78 and 27.73. So that's very quickly, what the BCa method is. So in our case, we'd have samples of... we would do bootstrap on the stress patterns. You would have multiple samples which would have been run, simulated under those different stress patterns and then bootstrap off those. And so we're going to get a distribution of our previous estimates or previous parameters, logA, EA, and sigma Right. CWhitman So again, here's our equation. So again, JMP The version of JMP that I have does not do bootstrapping. JMP Pro does, but the version I have does not, but fortunately R does do bootstrapping. And I can call R from within JMP. That's why I chose to do it this way. So I have I can but R do all the hard work. So I want to show an example, what I did was I chose some known true values for logA, EA and sigma. I chose them over some range randomly. And I would then choose that choose the same values for these parameters of a few times and generate samples each time I did that. So for example, I chose minus 28.7 three times for logA true and we get the data from this. There were a total of five parts per test level or six test levels, if you remember, three ramps, two different temperatures, six levels, six times five is 30. So there were 30 parts total run for this test and looking at the logA hat, the maximum likelihood estimates are around 28 or so. So that actually worked pretty well. I can look at...now for my next sample, I did three replicates here, for example, minus 5.7 and how did it look when I ran my method of the maximum that are around that minus 5.7 or so. So the method appears to be working pretty well. But let's do this a little bit more detail. Here I ran the simulation a total of 250 times with five times for each group. LogA true, EA true are repeated five times and I'm getting different estimates for logA hat, EA, etc. I'm also putting...using BCa method to form confidence bounds on each of these parameters, along with the median time to failure. So let's look and just plot this data to see how well it did. You have logA hat versus logA true here and we see that the slope is about right around 1 and the intercept is not significantly different than 0, So this is actually doing a pretty good job. If my logA true is at minus 15 then I'm getting right around minus 15 plus or minus something for my estimate. And the same is true for the other parameters EA, n and sigma, and I even did my at a particular p zero P zero. So this is all behaving very well. We also want to know, well how well is the BCa method working? Well, turns out, it worked pretty well. I want to...the question is how successful was the BCa method. And here I have a distribution. Every time I correctly correctly bracketed the known true value, I got a 1. And if I missed it, I got a 0. So for logA I'm correctly bracketing the known true value 91% of the time. I was choosing 95% of the time, so I'm pretty close. I'm in the low 90s and I'm getting about the same thing for activation, energy and etc. They're all in the mid to low 90s. So that's actually pretty good agreement. Let's suppose now I wanted to see what would happen if I increase the number of bootstrap iterations and boot from 100 to 500. What does that look like? If I plot my MLE versus the true value, you're getting about the same thing. The estimates are pretty good. The slope is all always around 1 and the intercept is always around 0. So that's pretty well behaved. And then if I look at the confidence bound width, See, on the average, I'm getting something around 23 here for confidence bound width, and around 20 or so for mu and getting something around eight for value n. And so these confidence bands are actually somewhat wide. And I want to see what happens. Well, suppose I increase my sample size to 50 instead of just using five? 50 is not a realistic sample size, we could never run that many. That would be very difficult to do, very time consuming. But this is simulation, so I can run as many parts as I want. And so just to check, I see again that the maximum likelihood estimates agree pretty well with the known true values. Again, getting a slope of 1 and intercept around zero. And BCa, I am getting bracketing right around 95% of the time as expected. So that's pretty well behaved too and my confidence bound width, now it's much lower. So, by increasing the sample size, as you might expect, the conference bounds get correspondingly lower. This was in the upper 20s originally, now it's around seven. This is also...the mu was also in the upper 20s, this is now around five; n was around 10 initially, now it's around 2.3, so we're getting this better behavior by increasing our sample size. So this just shows what the summary, this would look like. So here I have a success rate versus these different groups; n is the number of parts per test level. And boot is the number of bootstrap samples I created. So 5_100, 5_500 and 50_500 and you can see actually this is reasonably flat. You're not getting big improvement in the coverage. We're getting something in the low to mid 90s, or so. And that's about what you would expect. So by changing the number of bootstrap replicates or by changing the sample size, I'm not changing that very much. BCa is equal to doing a pretty good job, even with five parts per test level and 100 bootstrap iterations. About the width. But here we are seeing a benefit. So the width of the confidence bounds is going down as we increase the number of bootstrap iterations. And then on top of that, if you increase the sample size, you get a big decrease in the confidence bound width. So all this behavior is expected, but the point here is, this simulation allows you to do is to know ahead of time, well, how big should my sample size be? Can I get away with three parts per condition? Do I need to run five or 10 parts per condition in order to get the width of the confidence bounds that I want? Similarly, when I'm doing analysis, well, how many bootstrap iterations do I have to do to kind of get away with 110? Do I need 1000? This also gives you some heads up of what you're going to need to do when you do the analysis. Alright, so finally, we are now armed with our maximum likelihood estimates and our confidence bounds. So we can do We can summarize our results using the safe operating area and, again what we're getting here is something of a reliability map or a response surface of temperature versus power. So you'll have an idea of how reliable the part is under various conditions. And this can be very helpful to designers or customers. Designers want to know when they create a part, mimic a part, is it going to last? Are they designing a part to run at to higher temperature or to higher power so that the median time to failure would be too low. Also customers want to know when they run this part, how long is the part going to last? And so what the SOA gives you is that information. The metric I'm going to give here is median time to failure. You could use other metrics. You could use the fit rate you could use a ???, but for purposes of illustration, I'm just using median time to failure. An even better metric, as I'll show, is a lower confidence bound on the median time to failure. It's a gives you a more conservative estimate So ultimately, the SOA then will allow you to make trade offs then between temperature and power. So here is our contour plot showing our SOA. These contours are log base 10 of the median time to failure. So we have power versus temperature, as temperature goes down and as power goes down, these contours are getting larger and larger. So as you lower the stress as you might expect, and median time to failure goes up. And suppose we have a corporate goal and the corporate goal was, you want the part to last or have a median time to failure greater than 10 to six hours. If you look at this map, over the range of power and temperature we have chosen, it looks like we're golden. There's no problems here. Median time to failure is easily 10 to six hours or higher. So that tells us we have to realize that median time failure again is an average, an average is only tell half the story. We have to do something that acknowledges the uncertainty in this estimate. So what we do in practice is use a lower conference bound on the median time to failure here. So you can see those contours have changed, very much lower because we're using the lower confidence bound, and here, 10 to the six hours is given by this line. And you can see that it's only part of the reach now. So over here at green, that's good. Right. You can operate safely here but red is dangerous. It is not safe to run here. This is where the monsters are. You don't want to run your part this hot. And also, this allows you to make trade offs. So, for example, suppose a designer wanted to their part to run at 80 degrees C. That's fine, as long as they keep the power level below about 29.5 dBm. Similarly, suppose they wanted to run the part at 90 degrees C. They can, that's fine as long as they keep the power low enough, let's say 27.5 dBm. Right. So this is where you're allowed to make trade offs for between temperature and power. Alright, so now just to summarize. So I showed the differences between constant and step stress testing and I showed how we extract extract maximum likelihood estimates and our BCa confidence bounds from the simulated step stress data. And I demonstrated that we had pretty good agreement then between the estimates and the known true values. In addition, BCa method worked pretty well, even with n boot of only 100 and five parts per test level, we had about 95% coverage. And that coverage didn't change very much as we increased the number of bootstrap iterations or increased the sample size. However, we did see a big change on the confidence bounds width. And that the results there showed that we could make some sort of a trade off. Again, we could, you know, from the simulation, we would know how many bootstrap iterations do we need to run and how many parts per test conditions we need to run. And ultimately, then we took those maximum likelihood estimates and our bootstrap confidence bounds and created the SOA, which provides guidance to customers and designers on how safe a particular T0/P0 combination is. And then from that reliability map, then we able to make a trade off between temperature and power. And lastly, I showed that using the lower confidence bound on the median time to failure does provide a more conservative estimate for the SOA. So, in essence, using the lower confidence bound makes the SOA, the safe operating area, a little safer. So that ends my talk. And thank you very much for your time.  
Daniel Sutton, Statistician - Innovation, Samsung Austin Semiconductor   Structured Problem Solving (SPS) tools were made available to JMP users through a JSL script center as a menu add-in. The SPS script center allowed JMP users to find useful SPS resources from within a JMP session, instead of having to search for various tools and templates in other locations. The current JMP Cause and Effect diagram platform was enhanced with JSL to allow JMP users the ability to transform tables between wide format for brainstorming and tall format for visual representation. New branches and “parking lot” ideas are also captured in the wide format before returning to the tall format for visual representation. By using JSL, access to mind-mapping files made by open source software such as Freeplane was made available to JMP users, to go back and forth between JMP and mind-mapping. This flexibility allowed users to freeform in mind maps then structure them back in JMP. Users could assign labels such as Experiment, Constant and Noise to the causes and identify what should go into the DOE platforms for root cause analysis. Further proposed enhancements to the JMP Cause and Effect Diagram are discussed.     Auto-generated transcript...   Speaker Transcript Rene and Dan Welcome to structured, problem solving, using the JMP cause and effect diagram open source mind mapping software and JSL. My name is Dan Sutton name is statistician at Samsung Austin Semiconductor where I teach statistics and statistical software such as JMP. For the outline of my talk today, I will first discuss what is structured problem solving, or SPS. I will show you what we have done at Samsung Austin Semiconductor using JMP and JSL to create a SPS script center. Next, I'll go over the current JMP cause and effect diagram and show how we at Samsung Austin Semiconductor use JSL to work with the JMP cause and effect diagram. I will then introduce you to my mapping software such as Freeplane, a free open source software. I will then return to the cause and effect diagram and show how to use the third column option of labels for marking experiment, controlled, and noise factors. I want to show you how to extend cause and effect diagrams for five why's and cause mapping and finally recommendations for the JMP cause and effect platform. Structured problem solving. So everyone has been involved with problem solving at work, school or home, but what do we mean by structured problem solving? It means taking unstructured, problem solving, such as in a brainstorming session and giving it structure and documentation as in a diagram that can be saved, manipulated and reused. Why use structured problem solving? One important reason is to avoid jumping to conclusions for more difficult problems. In the JMP Ishikawa example, there might be an increase in defects in circuit boards. Your SME, or subject matter expert, is convinced it must be the temperature controller on the folder...on the solder process again. But having a saved structure as in the causes of ...cause and effect diagram allows everyone to see the big picture and look for more clues. Maybe it is temperate control on the solder process, but a team member remembers seeing on the diagram that there was a recent change in the component insertion process and that the team should investigate In the free online training from JMP called Statistical Thinking in Industrial Problem Solving, or STIPS for short, the first module is titled statistical thinking and problem solving. Structured problem solving tools such as cause and effect diagrams and the five why's are introduced in this module. If you have not taken advantage of the free online training through STIPS, I strongly encourage you to check it out. Go to www.JMP.com/statisticalthinking. This is the cause and effect diagram shown during the first module. In this example, the team decided to focus on an experiment involving three factors. This is after creating, discussing, revisiting, and using the cause and effect diagram for the structured problem solving. Now let's look at the SPS script center that we developed at the Samsung Austin Semiconductor. At Samsung Austin Semiconductor, JMP users wanted access to SPS tools and templates from within the JMP window, instead of searching through various folders, drives, saved links or other software. A floating script center was created to allow access to SPS tools throughout the workday. Over on the right side of the script center are links to other SPS templates in Excel. On the left side of the script center are JMP scripts. It is launched from a customization of the JMP menu. Instead of putting the scripts under add ins, we chose to modify the menu to launch a variety of helpful scripts. Now let's look at the JMP cause and effect diagram. If you have never used this platform, this is what's called the cause and effect diagram looks like in JMP. The user selects a parent column and a child column. The result is the classic fishbone layout. Note the branches alternate left and right and top and bottom to make the diagram more compact for viewing on the user's screen. But the classic fishbone layout is not the only layout available. If you hover over the diagram, you can select change type and then select hierarchy. This produces a hierarchical layout that, in this example, is very wide in the x direction. To make it more compact, you do have the option to rotate the text to the left or you can rotate it to the right, as shown in here in the slides. Instead of rotating just the text, it might be nice to rotate the diagram also to left to right. In this example, the images from the previous slide were rotated in PowerPoint. To illustrate what it might look like if the user had this option in JMP. JMP developers, please take note. As you will see you later, this has more the appearnce of mind mapping software. The third layout option is called nested. This creates a nice compact diagram that may be preferred by some users. Note, you can also rotate the text in the nested option, but maybe not as desired. Did you know the JMP cause and effect diagram can include floating diagrams? For example, parking lots that can come up in a brainstorming session. If a second parent is encountered that's not used as a child, a new diagram will be created. In this example, the team is brainstorming and someone mentions, "We should buy a new machine or used equipment." Now, this idea is not part of the current discussion on causes. So the team facilitator decides to add to the JMP table as a new floating note called a parking lot, the JMP cause and effect diagram will include it. Alright, so now let's look at some examples of using JSL to manipulate the cause and effect diagram. So new scripts to manipulate the traditional JMP cause and effect diagram and associated data table were added to the floating script center. You can see examples of these to the right on this PowerPoint slide. JMP is column based and the column dialogue for the cause and effect platform requires one column for the parent and one column for the child. This table is what is called the tall format. But a wide table format might be more desired at times, such as in brainstorming sessions. With a click of a script button, our JMP users can do this to change from a tall format to a wide format. width and depth. In tall table format you would have to enter the parent each time adding that child. When done in wide format, the user can use the script button to stack the wide C&E table to tall. Another useful script in brainstorming might be taking a selected cell and creating a new category. The team realizes that it may need to add more subcategories under wrong part. A script was added to create a new column from a selected cell while in the wide table format. The facilitator can select the cell, like wrong part, then selecting this script button, a new column is created and subcauses can be entered below. you would hover over wrong part, right click, and select Insert below. You can actually enter up to 10 items. The new causes appear in the diagram. And if you don't like the layout JMP allows moving the text. For example, you can click...right click and move to the other side. JMP cause and effect diagram compacts the window using left and right, up and down, and alternate. Some users may want the classic look of the fishbone diagram, but with all bones in the same direction. By clicking on this script button, current C&E all bones to the left side, it sets them to the left and below. Likewise, you can click another script button that sets them all to the right and below. Now let's discuss mind mapping. In this section we're going to take a look at the classic JMP cause and effect diagram and see how to turn it into something that looks more like mind mapping. This is the same fishbone diagram as a mind map using Freeplane software, which is an open source software. Note the free form of this layout, yet it still provides an overview of causes for the effect. One capability of most mind mapping software is the ability to open and close notes, especially when there is a lot going on in the problem solving discussion. For example, a team might want to close notes (like components, raw card and component insertion) and focus just on the solder process and inspection branches. In Freeplane, closed nodes are represented by circles, where the user can click to open them again. The JMP cause and effect diagram already has the ability to close a note. Once closed though, it is indicated by three dots or three periods or ellipses. In the current versions of JMP, there's actually no options to open it again. So what was our solution? We included a floating window that will open and close any parent column category. So over on the right, you can see alignment, component insertion, components, etc., are all included as all the parent nodes. By clicking on the checkbox, you can close a node and then clicking again will open it. For addtion, the script also highlights the text in red when closed. One reason for using open source mind mapping software like Freeplane is that the source file can be accessed by anyone. And it's not a proprietary format like other mind mapping software. You can actually access it through any kind of text editor. Okay, the entire map can be loaded by using JSL commands that access texts strings. Use JSL to look for XML attributes to get the names of each node. A discussion of XML is beyond the scope of this presentation, but see the JMP Community for additional help and examples. And users at Samsung Austin Semiconductor would click on Make JMP table from a Freeplane.mm file. At this time, we do not have a straight JMP to Freeplane script. It's a little more complicated, but Freeplane does allow users to import text from a clipboard using spaces to knit the nodes. So by placing the text in the journal, the example here is on the left side at this slide, the user can then copy and paste into Freeplane and you would see the Freeplane diagram on the, on the right. Now let's look at adding labels of experiment, controlled, and noise to a cause and effect diagram. Another use of cause and effect diagrams is to categories...categorize specific causes for investigation or improvements. These are often category...categorize as controlled or constant (C), noise or (N) or experiment might be called X or E. For those who we're taught SPC Excel by Air Academy Associates, you might have used or still use the CE/CNX template. So to be able to do this in JMP, to add these characters, we would need to revisit the underlying script. When you actually use the optional third label column...the third column label is used. When a JMP user adds a label columln in the script, it changes the text edit box to a vertical list box with two new horizontal center boxes containing the two... two text edit boxes, one with the original child, and now one with the value from the label column. It actually has a default font color of gray and is applied as illustrated here in this slide. Our solution using JSL was to add a floating window with all the children values specified. Whatever was checked could be updated for E, C or N and added to the table and the diagram. And in fact, different colors could be specified by the script by changing the font color option as shown in the slide. JMP cause and effect diagram for five why's and mind mapping causes. While exploring the cause and effect diagram, another use as a five why's or cause mapping was discovered. Although these SPS tools do not display well on the default fish bone layout, hierarchy layout is ideal for this type of mapping. The parent and child become the why and because statements, and the label column can be used to add numbering for your why's. Sometimes there can be more and this is what it looks like on the right side. Sometimes there can be more than one reason for a why and JMP cause and effect diagram can handle it. This branching or cause mapping can be seen over here on the right. Even the nested layout can be used for a five why. In this example, you can also set up a script to set the text wrap width, so the users do not have to do each box one at a time. Or you can make your own interactive diagram using JSL. Here I'm just showing some example images of what that might look like. You might prompt the user in a window dialogue for their why's and then fill in the table and a diagram for the user. Once again, using the cause and effect diagram as over on the left side of the slide. Conclusions and recommendations. All right. In conclusion, the JMP cause and effect diagram has many excellent built in features already for structured problem solving. The current JMP cause and effect diagram was augmented using JSL scripts to add more options when being used for structured problem solving at Samsung Austin Semiconductor. JSL scripts were also used to make the cause and effect diagram act more like mind mapping software. So, what would be my recommendations? fishbone, hierarchy, nested, which use different types of display boxes in JSL. How about a fourth type of layout? How about mind map that will allow more flexible mind map layout? I'm going to add this to the wish list. And then finally, how about even a total mind map platform? That would be even a bigger wish. Thank you for your time and thank you to Samsung Austin Semiconductor and JMP for this opportunity to participate in the JMP Discovery Summit 2020 online. Thank you.  
Aurora Tiffany-Davis, Senior Software Developer, SAS   Get a peek behind the scenes to see how we develop JMP Live software. Developing software is a lot more than just sitting down at a keyboard and writing code.  See what tools and processes we use before, during and after the "write code" part of our job. Find out how we: Understand what is wanted. Define what success looks like. Write the code. Find problems (so that our customers don't have to). Maintain the software.     Auto-generated transcript...   Speaker Transcript Aurora Tiffany-Davis Hi I'm Aurora Tiffany-Davis. I'm a software developer on the JMP Live team. And I'd like to talk to you today about how we develop software on the JMP Live team. First, I'd like you to imagine what creating software looks like. If you're not a developer yourself, then the image you have in your mind is probably colored by TV and movies that you've seen. You might be imagining somebody who works alone, somebody who's really smart, maybe even a genius and their process is really simple. They think about a problem, they write some code, and then they make sure that it works. And it probably does, because after all, they're a genius. If we were all geniuses in JMP Live, then this simple process would work for us. But we're not and we live in the real world. So we need a little bit more than that. First of all, we don't work alone, we work very collaboratively and our process has steps in it to try to ensure that we produce not just some software but quality software. I'll point out that there's no one correct way to develop software. It might differ across companies or even within companies, but I'm going to walk you through our process by telling you what I would do if I was going to develop a new feature in JMP Live. First of all, before I sit down to write code, I have to have some need to do so. There has to be some real user out there with a real problem, something they've identified in JMP Lives that they don't work...that doesn't work the way they think it ought to, which is a nice way of saying they found a bug or any feature that they've requested. We keep track of these requests in a system internally. So the first thing I would do is go into that system and find a high-priority issue that's a good match for my skill set and start working on it. Next I need to understand the need a little bit more. I'll talk to people and try to figure out what kind of user wants this feature. How are they going to use it. And then I'll run JMP Live on my machine and take a look at where this feature might fit in to JMP Live as it exists today. Or I might open up the code and look at where the new code might fit into our existing code base. Once I think I have a pretty good understanding of what's needed, I'll start working on the design. Again, I'll talk to people. I'll talk to people on the team and say, "Have you worked on a feature similar to this? Have you worked in this part of the code before?" And I'll talk to user experience experts that we have in JMP. I'll try not to reinvent the wheel. So if there's an aspect of the feature that is very common and is not specific to JMP Live, for example, something like logging some information to a file. That's a solved problem, a thousand people have had that problem in the past. There might be a good open source solution for it. And if so, I might use that after carefully vetting it to make sure that it's safe to use. That still leaves a lot of ground to cover, a lot of JMP Live specific code that needs to be written. And so I'll write articles and diagrams and describe what I propose and make sure that everybody on the team is comfortable with the direction I'm going with the design. Then I'll actually sit down and write code. For that I'll use an integrated development environment, which is a tool for writing code that has a lot of bells and whistles to help you be more efficient at your job. Now I've written some code. Before I check it in, I want to find the problems that exist in the code that I just wrote. I'm only human, so the chances that I wrote 100% flawless code on my first try are pretty slim. I'm going to start by looking for very obvious problems and for that, I use static analysis. Static analysis looks at my code not while it's running, but as though it was just written down on a page. An analogy would be spellcheck in Microsoft Word. Spellcheck can't tell you that you've written a compelling novel, but it can tell you if you missed a comma somewhere. Static analysis does that for code. Once I've found and fixed really obvious stuff like that, I'll move on to finding less obvious problems. And for that, we use an automated test suite. This differs from static analysis because it actually does run the code. It'll run a piece of the code with a certain input and expect a certain output. We've written a broad range of tests for our code. And I'll sit down and write tests for the feature that I'm working on. This is really useful because sitting down to write the tests forces me to clarify my thinking about how exactly the code is supposed to work. Also, it offers a safeguard against somebody else accidentally breaking the feature in the future. It's a great way to find problems early. Now move on to manual tests. I'll run JMP Live on my machine and exercise the new feature and make sure that it's working the way I think it ought to. I might even poke around in the database that sits behind JMP Live and keeps track of the records, posts, groups, users, and comments to make sure that all those records are being written in the way that I think they should be. Now I'm cautiously optimistic that I've written some good code and next step is peer review. I'd like my peers to help me look at the code and find anything that I might have missed. I might have missed something because I just have my blinders on about something or because somebody else on my team just has knowledge that I lack. This step is often really helpful for the reviewer as well because they might learn about new techniques. After it's gone through peer review, we're ready to actually commit the code or check it into a source code repository. We have a continuous build system on our servers that watches for us to check code in. And when we do, it does a bunch of stuff with that code. For example, making sure the right files are there in the right place, named the right way and so on. It will also go back and rerun our stack analysis and rerun our entire automated test suite. This is useful because after all, we're only human. Someone might have forgotten to do this stuff earlier in the process or something might have changed since they last did it. Once the code makes it through the continuous build, it's now available to people outside of the development group. The first people to pick up on this our test professionals within JMP. They're going to go through and they might add to our automated test suite. They might run some manual tests. They might look at how the software works on different operating systems, different browsers and so on. They'll think up crazy things the user might do and look at what happens and how the software responds. They are really crucial part of our process. They think really creatively about trying to find problems in our product. Once they've signed off, now the software is available to be picked up in our next software release. Let me zoom out now and show you the product...the process as a whole. And it's a lot of steps. That's a lot of stuff. Do we really do all this for all of our code? Believe or not we do, but this process scales a great deal. So for a really simple obvious bug, we're going to step through this process pretty darn quickly. But for a very complex new feature on every step of this process, we're going to slow down, take our time and take a lot of care to make sure that we really get it right. Anything that takes time, costs money for a company. So why is it that JMP is willing to invest this money in software quality? One reason is really simple. We have pride in our work and we want to produce a good product. The second reason is a little bit less idealistic. We know that if our product has problems and we don't find those problems, our customers will, and that's not good for business. So we'd like to minimize that. I should point out that while JMP does invest time and money in software quality, we don't invest that kind of money that, for example, an organization like NASA would, so we can't promise perfection. Like any piece of software that's out there in the marketplace, you might find something in JMP Live that doesn't work the way you think it ought to. If you find any...suggestion, we'd like to invite you to go to www.jmp.com/support and let us know if there's anything that you think should work differently or any features that you would like in the future. That kind of real-world feedback from actual users is incredibly valuable to us and we really welcome it. That's all I have for you today, but I really hope you enjoy the rest of Discovery. Thank you.