Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Corinne Bergès, Six Sigma Black Belt, NXP Semiconductors Kurt Neugebauer, Analog Design Engineer, NXP Semiconductors Da Dai, Design Automation Engineer, NXP Semiconductors Martin Kunstmann, R&D-SUP-Working student, NXP Semiconductors Alain Beaudet, Product and Test Engineer, NXP Semiconductors   Structured Problem Solving (SPS) is one of the three pillars of NXP Six Sigma system, with Quality Culture and Continuous Improvement, and demonstrates still more NXP Quality system maturity. Some key approaches in NXP SPS are fitting with the DMAIC/DMADV, 8D or 5-Why frameworks. They widely use statistics to change assumptions into evidences, necessary for a real defect root cause elimination: modeling, DOE, multivariate analysis, …Two specific statistical analysis are described. In design for automotive, about simulation of parametric, hard or soft defects, purpose is to implement the best algorithm to reduce number of simulations, without impacting test coverage or failure rate estimation precision: for this, JMP provides interesting options in clustering. NXP experiments will result in an algorithm and in some recommendations for the new IEEE standard on study about defect coverage accounting method. Now, downstream in manufacturing, when it deals with capability index computation, and with normality test, to bypass high sensitivity of these tests for a slight abnormality, a methodology was designed in JMP to quantify shift from normality, by using the Shash distribution and its Kurtosis and Skewness parameters. A script was implemented to automate it on the more than 3000 tests for an automotive product.  (view in My Videos)  
Laura Lancaster, JMP Principal Research Statistician Developer, SAS Jianfeng Ding, JMP Senior Research Statistician Developer, SAS Annie Zangi, JMP Senior Research Statistician Developer, SAS   JMP has several new quality platforms and features – modernized process capability in Distribution, CUSUM Control Chart and Model Driven Multivariate Control Chart – that make quality analysis easier and more effective than ever. The long-standing Distribution platform has been updated for JMP 15 with a more modern and feature-rich process capability report that now matches the capability reports in Process Capability and Control Chart Builder. We will demonstrate how the new process capability features in Distribution make capability analysis easier with an integrated process improvement approach. The CUSUM Control Chart platform was designed to help users detect small shifts in their process over time, such as gradual drift, where Shewhart charts can be less effective. We will demonstrate how to use the CUSUM Control Chart platform and use average run length to assess the chart performance. The Model Driven Multivariate Control Chart (MDMCC) platform, new in JMP 15, was designed for users who monitor large amounts of highly correlated process variables. We will demonstrate how MDMCC can be used in conjunction with the PCA and PLS platforms to monitor multivariate process variation over time, give advanced warnings of process shifts and suggest probable causes of process changes.
Monday, October 12, 2020
As we develop analytical tools in JMP, we inevitably must make decisions about how to prioritize:   Should we make the product more powerful by adding more muscle? Should we make it sexier, more exciting? Should we focus on pain relief, making it less frustrating, less burdensome? In the language of long A-words, should we go for anabolic, aphrodisiac or analgesic?   John Sall thinks the answer is analgesic. Pain relief should be the central motivating force for development. Of course, the three aren’t mutually exclusive. Adding an exciting power feature could also relieve pain. But pain relief is central, because pain is the condition that can really freeze us, demotivate us, make us stop at a less-than-full perspective of what our data can tell us.       Auto-generated transcript...   Speaker Transcript Jeff Perkinson When it comes to developing JMP features, John Sall, our next speaker, has an interesting way of thinking about how we prioritize. We do have to prioritize which features we...   which features we want to invest in. So how do we think about what's most important? Are features that make our product more powerful what we should work on?   How about features that make it more attractive or sexier or features that focus on pain relief, easing the burden of analysis?   Well John thinks that pain relief should be the main motivator for our R&D team. John is the co founder of SAS. He's the lead architect of JMP, which he founded more than 30 years ago.   For scientists and engineers who have data and need a way to explore it needed...need a way to explore it easily visually on the desktop.   He's going to talk about the driving forces behind JMP development and the ways that we come across features we want to work on. So with that, I'd like to welcome you, John. Thank you. John Sall Thank you, Jeff. And here we are live from the new digital conference center in Cary North Carolina, where we can fit a lot more people.   So let's dim the lights a little.   Well, I'm not really here in digital conference center. I'm really at home. So I got a home office. Whoops. Not that one.   Like all of us. I'm at home, delivering this conference. But let's switch to a corporate background and get started.   So what should I talk about?   Well, when we have JMP releases, on those years that the conference alliance along JMP releases, that's what we talk about, but on the alternate years I always just pick a topic. So I picked big statistics one year, or ghost data the next year or secret features   two years ago. Today my topic is big words that start with the letter A.   Well, these are words characterized themes for developing JMP. So let me share my screen.   And   minimize   here. Share.   So which theme is most important? And here are the three words start with...the big words that start with the letter A.   As Jeff just mentioned, first is anabolic. Well, anabolic is all about growing muscle, making JMP more powerful.   Anabolic processes build organs and tissues and these processes produce growth and differentiation of cells to increase body size.   So increasing the power of JMP would be the opportunity and that's of course very important, but the next   aphrodisiac, making JMP more sexy, more exciting, the thrill of discovery. So we want that to be an attribute too.   But we also care about your progress during the work stream, so analgesic is the third word, and that means pain relief.   pain relief.   So I want people to express things like the following imaginary quotes, \!"Version 15 has just served me...saved me hours or even days of work that I used to have to labor over.\!" That's a lot of pain relief.   Or \!"I used to have to do each one individually and now it is all in one swoop.\!" A lot of pain relief there. Or \!"I used to have to use several other tools to get everything done, but now I can do everything in JMP.\!"   Or \!"I showed my colleagues how I did it so easily. And they were amazed.\!" Or \!"I used to have to program all this manually, and now it's just a few mouse clicks and it's done.\!"   Or \!"JMP keeps me in flow, keeping my attention on analyzing the data rather than diverting it too much to operational details.\!" These are the expressions of pain relief that we want to hear.   Now why well pain and frustration are real demotivators. Flow is important. You're undistracted when in flow. When in flow, you're, you're more likely to learn something from the data and make discoveries.   Now empower is important too, but new features tend to be used by only a few and old features by many. So we want to make the old features work better   to reduce the pain in using them. And of course, productivity is hugely dependent on how effective our tools are with respect to the use of time. So if we can make it easier and faster to get to the results, that is a huge win. So let's go through a lot of examples.   First, the scripts that come with the data tables. So here you see side by side, an example from an earlier release of JMP. I think it's JMP 14 or 13, where   where you used to have to hold down the mouse key on this button and then it brought up a menu item and then you found a run button to run that script.   But the new way just has a play button right next to the script. So you just click on that play button and it saves pulling down a menu.   Well, that's not a huge win, you think that just the difference between clicking a button and pulling down a menu is not very much. But it's also a big win for new users, because new users no longer have to understand   that a hot button will have a run underneath. But there's the play button right there with the script. And so there's a more understanding that these are scripts.   And the ability to store scripts with your data is a big deal in JMP and we want people to learn that right away   and not have to fish around two or look at a tip to learn that that's the way it does. And it's a great convenience to be able to store all your scripts   within the data table itself and not find other places to store them. So for new users the play button. So, that is some pain relief just in that simple change.   And this is a change that we regret not making much earlier in JMP.   Let's talk about preparing data. And one of the big things that you have to do when you prepare data is join data from multiple sources.   Now JMP, even from version one, had to join utility. But this still involves complex dialogue and working with multiple data tables.   So here's the classic example where I'm renting movies and we have our movie rentals transaction table   that has the customer, the item number, the order date, the order year, and then we have our inventory of movies to rent, details about each movie. And then we have the customers...information about the customers and so on. And we want to ask questions, involve all three tables like   which genders are prevalent in which genres of movies. Well that involves   going across all three tables, and of course, the join command is the way to do it. I take the transactions data and I   merge it with customers. And of course, customer ID is the matching   column and I want to include the non matches from the transaction side. So if I don't have customer data, I still don't lose my transaction data. And that's called the left outer join, and so that makes the new data table. And now to that data table, I want to join that   to the inventory data. So I join that by the item number which I match then and that I also want a left outer join and now I have   another table. And so now, instead of having three tables, I have five tables, the results of that join, but now I can look at answering the question of gender by genre, which   gender tends to rent more of which type of movie.   And involved making all these new intermediate data tables as well.   It turns out that there's a lot easier way to go and the pain of just these extra steps and extra tables can be vastly reduced. So let me undo this table on this table.   And I want to point out under the tables command, there is a command called JMP query builder and this brings up the the sequel facility   for joining tables, that's much easier than all those join dialogues, but it still also makes new data table.   But a couple releases ago, we came out with something even better that reduce the pain even further and that's virtual join. So if I look at this transaction table,   I noticed that it's already joined.   So what's happened is that when I prepared the, the movie inventory table, it's uniquely identified by item number. And so I give item number then   a link ID saying it's uniquely identified by that and that's the way I can index that that table. And similarly, I have an item ID to identify the customer data. And when I prepare my transaction data, I have these two have, in a sense foreign keys,   a link reference and it's referencing different data table by the, the...the identifying ID across them. And so, it automatically links it and I already have everything I need to answer that question. So I can do   for each   gender, I can   find the genre of the movie I want to do. And so I can see here with action movies are liked more by the males and family movies by the females and   rom com more by the females and so on. So I've answered my question involving three tables without the need of going through all the joins, because it's virtually joined by matching up those ID columns across the tables.   So I've made   a lot of pain relief from that. So let's close that, in fact.   So data table joining has involved, I think, a lot of pain relief over the releases.   Recoding,   that's one of the main other things we do when we're preparing data sets.   We started with a very simple recode many releases ago where we just had a field for the old value and the new value and the new value was started out with the same as the old value and then I can cut and paste or edit those values in order to recode   into a new column or the existing column.   And recode every release has gone through dramatic improvements and we're very excited with the current state of that. And so let's look at this data set of hospitals, district hospitals. And now recode is right on the   right button or right click menu in the column header. And I can go through and just   select a bunch of things and group them to new value, like clinic.   But I think there are lots of shortcuts as well. And one of my favorite shortcuts is   to group similar values. And so I can give some threshold of the number of character differences and so on and it will automatically group all those things. So it'll group all the district hospitals with slightly different spellings into one.   And I can still...   looks like it missed one. But it's okay, I can now group that one with the others and   all the, all the rural hospitals have been automatically grouped by those matching and looks like I can match an extra one here.   And now all of a sudden I have something. I've saved a lot of effort in doing all that recoding.   And of course, in JMP 15 we added new features, which I won't show here, but I want to point it out. If you have another data set with all the   ...proved that...the best category names for those categories, I can go through a matching process and have it choose the closest match when it finds a match to all those things. So   so recode has gone through a lot of evolution over the releases to reduce the pain in that operation. And of course, think about it. How much time do we spend preparing the data versus analyzing it?   Often it's anywhere from 70 to 90% of our time is spent preparing the data. And so, reducing the pain of that and reducing the time of that become major wins. So pain relief on recoding.   Clicking on values. Here's   the cities data set.   And let's suppose when I'm looking at the details for some of these things, like might be an outlier for something, I want to look at it in greater detail. And so I can turn that into a link. So let's   let's turn it into a wiki page Wikimedia page thing. I'm going to copy that text there and go to column info and then turn that into a link, which is the expression handler.   And now I can take   this   table and convert it into a web   address. So   I'm going to go to wikipedia.org and then take the   the name of that city and change it to title case and I can test it on that row and brings up Albany.   And so now I'll say...   Oh, OK.   And now I can click on any city and then get the Wikipedia article on that city. Or it can change it to map coordinates. So I can...let's copy that text and go to the column info and now instead of that, I'll   do title case on that.   Let's see if that parses OK. And now I can click on Denver and it will query Denver and the Google Maps.   Okay, so I've made things into links to do...I can do searches, can do map requests and so on. I can even   paste this speak into it and click on that and have it speak the name of the city. And so this I think gives great power to be able to store links in our data tables   that link out to web pages or do other things, anything you can express with ACL, you can do with those event handlers. So that can reduce a lot of pain.   So one of my favorite pain reduction techniques is broadcasting, and this is where you hold down the command key or the control key to do multiples, where you have a lot of analyses that are similar. Okay, so   this works not just for menu commands, but for many buttons and for resizing graph frames and doing other things, pasting things into graphs and so on.   So let's do an example of that. And let's click on...or just do a distribution on all these columns.   And let's say I wanted to get rid of   this box plot.   And if I just held down   that menu item, I would eliminate the box plot for that item. But what I want to do is eliminate the box plot for all items. So what I'm going to do is hold down the command key and uncheck the box plot and now it's unchecked for all those things. I can now uncheck the quantiles   and it will uncheck for everything, because it has taken that command and broadcast at all the active objects in that window, and for those active objects that understand that command, the quantiles command for example, it will then obey that that command.   Now that even works for some things that have to do with prompting dialogues. So let's look at summary statistics. Let me hold down the command key and customize summary statistics.   And let's say I don't want the confidence limits, but I do want the number of missing values. And I'll say, okay, and it's going to apply that to all...   all the things. In order to do that, it's had to   take the results of that dialogue, make a script out of it and then broadcast that script to all the other places in that window.   Now that doesn't happen all the time. Sometimes when you get a prompt, you will get a prompt for every analysis. For every release we try   to implement a few more details, where the dialogue is done before the broadcast so it'll broadcast the same dialogue to every place available. And of course, everything else I can broadcast a change and it will change, you know, the size of all these items, or I can broadcast   changing the background color, and I can change all the background colors to orange in all the plots   that that seem relevant for that that frame box. So this broadcasting becomes a very powerful tool.   So,   that has saved a lot of pain for most of us, but there's still cases where you may have 40,000 by groups, as came in earlier this year, and he wants to do an inverse prediction on all 40,000 and it prompts you 40,000 times.   That will be fixed for version 16. That is a lot of pain relief for that one user.   Saving formula columns.   I love formula columns. When you fit a model you can save a column of predictions. But in that prediction, there's a live formula so that you can examine that formula, you can modify it, you can apply it to new data that comes in. You can profile it to the save formula. You can   if you have   JMP Pro, you can go to the formula depot and do a lot of other things. You can generate code from it and so on. So the ability to save formulas is an important thing, but sometimes   if you have by groups, for example, saving formulas has a lot of extra subtlety.   For example, let's fit, this is the diabetes data, where we fit the response   against all these predictors and but we want to do it separately for each gender.   So let's do that.   And so we have two by groups for gender one and gender two. And now let's say we want to save   the predicted value, in fact, for both of these. Now   in the old days before we subjected to this to a lot of pain relief methods,   what would happen when you saved it, it would go to a local data table. If you look in the data table window under redo, it shows the by group table.   So really for gender=1, there's 235 in this virtual data table here, which you can show for gender=1. There's a table variable here that shows the the by group for it.   And it would save it to this temporary table instead of the real table that you have that is really saved.   But   some time ago, when you saved the prediction formula, we save it to the real data table. And if I look at this, I just saved it for gender=1,   and if I look at the formula for that, it shows that if gender=1, that was the by group clause, then it has this linear combination that forms a prediction for that variable.   And then if I then do gender=2 or if I held down the command key to broadcast prediction formula, now it has both of them and I have both clauses available for gender 1 and gender 2. So it's adding a clause each time to that output by group.   By the way, the same thing happens when you save other things, when I save a column of residuals, and let's hold down the command key this time.   If I save a column of residuals, it will save it for each by group, it will save the residual appropriate for that by group. It's as if you subtracted predicted formula from the original response, but this time without being a formula column.   In almost all places in JMP, we haven't done it everywhere, but we've done it in most places, saving into a by group   or in some cases through a where clause, it will then save it to the original data table with the condition in the if statement. Okay, so that has   saved a lot of effort. If that hadn't been available, tt was saved in a temporary table and then you'd have to cut and paste that formula and make if tables to the original table. So we've solved these problems. Now let's suppose I saved it again, did that whole process again,   and   and did another thing, say with different variables, some of them removed,   and now saved it again.   So,   I save prediction formula, this time holding down the broadcast key, it's actually making a new one instead of saving into the old one and appends a 2 after it. Well, how does it know not to save it into the old one?   Well, each fit has an ID to it. And if I look at the properties   by ID, it has the number to it, a unique number which is regenerated for each fit, and so as long as it has a different by ID,   the different by groups will have the same ID with one by partition. But as long as it has the same by ID, it will save it into the same thing. Otherwise it will make a new place to form to save it.   Also, whenever we say predictive values,   especially prediction formulas,   it will create a attribute the, the creator and in some cases other information. In this case it has the target variable y, that's what is predicting and the creator fits least squares.   And then when you do other platforms such as if you have JMP Pro model comparison, then that model comparison will understand which which predicted values referring to which creator, in which target. And so it can keep track of all that information for those added value platforms.   So,   formula columns work by adding new clauses and other properties are used, including the by ID.   And the prediction clauses, if you save them to categorical variables, it saves a whole range of variables to have the probabilities for each response level, and all that is contained with all the metadata and needs. And that should save a lot of pain.   So,   removing effects from models.   Well, let's look at an example   of fitting   just fitting height by weight.   Well, here's an example of a high degree polynomial. This is a seventh order polynomial   that fits better than   sixth order polynomials and so on. But would you trust that fit?   It turns out that if you give the model a lot of flexibility by introducing high polynomial terms or just by introducing more variables, it gives a lot more opportunities for to fit. And in this case it's   it's allowed the flexibility to make a deep dive between the the previous data and the last data point just to fit that last data point better.   And so it's overfitting. It's way...it's using that parameter to fit noise instead of fitting the   the data itself.   And so overfitting is is a problem anytime you have big models. So   you want to fit the signal, not the noise. Okay, large models tend to introduce more variation into the prediction, because the prediction, after all, is   a function of the y variable they're using, but also that y variable is is is a systematic part of that variable plus the error part of that variable if that variable part   is random, then your prediction involves that that randomness. And if you allow it too much flexibility, it's going to end up with   an overfit problem that you're going to predict much worse by including all the variables in that model.   So the cure for that is to shrink the model, to reduce the size of the model or reduce the size of the coefficients in the model,   so that less of that variation from the random term of the model is transmitted into the predicted value. And so   in small DOE problems, this is not an issue because the data is small and the data is is well-balanced to fit exactly what model you're going for. But for observational data in any large models, overfitting is a real problem.   So users often didn't appreciate that until we introduced cross validation in JMP Pro. So if you have JMP Pro, you can set up a validation column,   which will hold back some of the data in order to estimate the error on that hold back data set. And here's an example where I have...I'm trying to predict the concrete properties   depending on all these ingredients in the concrete, and I have a huge model for it. And if I just run that model, but hold back some of the data and look at the R square on that,   I've for SLUMP, I have a great fit. I have an R squared 79, but on the validation set I fit, I have an R square that's negative.   Any R square that's negative means it fits worse than just fitting the mean. So if I'd fit the mean, I'd have an R square of zero. If I fit this whole model, I have an R square that's much worse than that. So, with large models, you can go worse than just fitting the mean it's it's worse than   It's kind of anti informative because the model is so big and we had no effort and cutting down the size of the model. The model has been dominated by the noise and not the signal. So this is a problem that you should pay attention to.   So, the important part is to be able to reduce the size of the model.   And now we did introduce a model dialog command to do that.   Let's go into the diabetes with with model.   I can run this model and if, let's say I want to take out a lot of these things are not very significant. You know, age   has a totally non significant contribution to the model, and so I want to eliminate age. Well then I would, I could go back to fit model and   recall it or and then eliminate age. Or there's, I can just go back to the model dialogue directly here and fit age and remove it, but I may have a long list of models here and going back and forth to do these things is a fair amount of work.   And so rather than do that...   let's see, what am I doing here.   Several releases ago, we introduced a new report called the effects summary.   And with effacts summary, it makes it trivially easy to subset the model to make it predict better,   give it less flexibility.   So I can say, well, age, I can remove that.   Or, well, there's lots of variables. Let's remove three more.   LDL, well, that looks more significant. And so I can save that model.   Let's just save it to the data table.   And I come back later and say, well, let me remember that model that has this.   I can actually undo the previous thing. And so I can undo that and it brings back   three of the variables and I can undo that and it brings back age. So it actually stores, when when I use the effects summary to edit the model, it actually stores a memory of all those things. And if I look at that script,   I can see this history thing. So every time I did effect model, it storing a clause of history. And when I do the undo's, it's undoing it back to the that history.   So removing is easy. I can also add things. If I subtracted these two things I could add it back. I could add back, say, age.   Not a good thing to do, but I can do it. I can also edit the model by bringing back a small version of the model thing and   of the model dialogue and add compound effects and so on.   Now another thing that happens with large models and   (let me undo this).   With large models,   you're doing a lot of hypotheses tests.   So if you have a large number of hypotheses tests, there's some adjustments that you should consider and one of them is called the false discovery rate. And so instead of treating all these p values as if they were independent tests,   I want to apply an adjustment so that those p values are adjusted for the multiple tests bias, for the selection bias involved in subsetting the model. And so I can apply a false discovery rate   correction to it. So instead of the regular p value, I have the false discovery adjusted p value for that. And now I'm being more realistic. So this is going to help me with the overfitting problem and the hypothesis...the selection bias problem in doing a lot of multiple test   things.   So,   now   let's   do the next topic and that's transforms.   Suppose   you want to do a model,   but   instead of y, you want to fit the log of y.   Well, before what you would do is create a new column in the data set,   Log y,   and then for Log y, you   specify a...   do a formula.   And I can take   a log of it.   And now I can go back to my model specification and do that Log y   and then I got my fit.   Oh, it's missing, what did I do?   Forgot to enter the   objective, the argument.   And so now I can   fit the log of Y   And now   I've done it. But let's do the profiler.   My profiler is in terms of the log of Y. Let me save the predicted value. My prediction formula is in terms of the log of y,   and now I'm going to have to go and hand-edit that...   that formula and take the exponential of that   to bring it back on on the original scale. So that's that's a lot of pain doing transforms that way.   Well, several releases ago   we introduced transforms.   So I can take that transform and   transform it to the log   and now use it directly there as a transform and it's not part of the data table. It's a virtual variable with a formula, but not added to the data table.   And now I can fit my model to the Log y. Let's   remove age so it fits a little better.   Whoops.   Had that selected too.   And now I've fit the response Log y.   So if I profile that, what do I get?   Factor profiling, profiler. Instead of profiling log of y, it profiles y. The profiler looks at that as a transform and says, well, I can invert that transform and go back to the original scale. And that's what it does.   It   untransforms, back-transforms through the log of y to take the prediction and put it on the original scale.   And it will do that for most transforms that can unwind. If a transform involves multiple variables and so on, it can't do it and it will just do the transforms. So the original   Same thing when I save a column. When I save the prediction formula,   it saves a prediction formula on the scale of yrather than the log of y.   And so these things are an incredibly time saving and saves a lot of effort in using transforms.   So there's...   Let me   go to diabetes again   and do another...consider another transform.   When you're doing variable,   there's a Box-Cox transformation that you can get.   And   among the factor profiling, the Box-Cox option   tells you if you transformed a whole range of power functions,   what, what would be the best to do? Should it be just untransform? That would be a Box-Cox lambda value of one.   If you did zero, that would be equivalent to taking a log of it. If you did -1, that would be equivalent to taking the reciprocal of it.   The power to the -1. If you did around .5, that would be equivalent to taking the square root of that.   And it's telling you that this model would fit better on a transform scale adjusted for that transform, if it was more along the square root transformation, where lambda was .453. That's the optimal value in that the Box-Cox transformation.   And you can zoom in on this   with a magnifier to get   to get it more precisely.   So,   So now I can transform, and several releases ago   I can...I added several columns. One is refit with transform, which will make a new window with the transform response.   And another is replacing transform and I'm going to do that. And rather than .453, I'm going to just take square root transformation (.5) and now I fit the model with that transform.   And and now lambda best is around 1, which is where it should be, because it's already transformed at once by Box-Cox transformation. And now I can save the predicted value of that and   profile it and so on. And I can even undo it. So if I don't like that transform and I want to go back to the original, I can go back to the original and refits.   So,   we've done a lot of pain saving   in transforming responses.   So now   there's a special pain, a special place of pain when you have a lot of data.   And we've gone to a lot of effort to try to solve big problems with less pain.   Whether you have lots of rows, lots of columns, lots of groups, many models to try,   in today's world we live in a world of big data with big problems.   So if we have analyses that were originally designed to handle small problems, it may not be appropriate for large ones, and the central problem with big problems is that there's just too much output to sort through.   If by fitting the same model to 1,000 variables, I have 1,000 reports to sort through. If I'm doing looking for outliers among 1,000 columns, there's just outliers, you know, separate reports for each column and so on. I want to be able to   more efficiently get through a lot of large data sets. So we developed the screening menu,   which is meant to solve these large kinds of problems. And plus, there are lots of places throughout JMP that solve large problems better as well. For example, time series. The new time series forecast platform will forecast many times series, instead of just one.   So the items on the screening platform explore outliers, explorer missing values, explore patterns. These are for doing checks of data and then things looking for associations, response screening,   process screening, predictors screening. And of course, the time series forecast is a new item. It's not in the screening menu, but it's organized for handling large problems.   And all these things take advantage, not just of more compact ways to represent the results so you don't have to the thousands of reports, but they're also computationally efficient. They use multithreading so it takes advantage of your   the multiple   cores in your CPU to make it very fast.   So let's say you got a lot of process data; you have 568 variables. So, which of these variables looks healthy? Well process screening is designed to answer that. And so it can sort by the stability or sort by the capability (Ppk) or which ones are bad off or sort by   control chart measures, out of control accounts and so on.   But what's even...I even like like better, are some of the tools that show all the processes in one graph. And there's two of them that I love. One is the   the goal plot that shows   how each process behaves with respect to the spec limits and if it's a capable processes it's in this green zone here.   If it's marginal, it's in the yellow zone. If it's not, it's in the red zone, and if it's high up, it has too much variance. If it's to the one side of the other, then there's a problem.   With...   It's off center, it's off target.   But if it's in the green zone then it's a good process, and with version 15 we introduce the   graphlets, the hover help   so you can see each process as you hover over it. And then the other   plot that I love in in summarizing all these these things and reducing the pain of looking through all these reports is this process performance graph. So on the vertical axis, it tells you whether you're within spec limits by the capability Ppk.   So if you're above this line at 1.33, then you're looking fine as far as the distribution of values with respect to the spec limits. You're well within the spec limits.   If...then the stability index is on the x axis. So if it's a pretty unstable process even if might be capable but unstable. So if you look at that process, it might have some   stability thing that wanders around some. And if you're in the yellow zone, you're   capable but unstable and so on. The red zone is the bad zone where you're both incapable and unstable. And so looking through hundreds or thousands of processes is easy now, where it used to be a lot of pain.   The question is, what changed the most? Here's some survey data where   over many different years, you asked a question about their activities. Did you   go camping? Or gamble in a casino? Or did you have a cold? You know, all these activities and you want to know which among all these activities (and there's   I think, 95 different activities) which of these made the most difference. And so you're looking for, you know...one thing you could do is go through one at a time   and fit that activity by year of survey.   Instead of looking at one at a time. I can look at response screening   and just look at one chart to see what changed the most.   And so this is showing a chart for all 95 of those things. First, the significance in terms of the negative log of the P value,   which we call the log worth, which is adjusted for the false discovery rate. And so it takes care of some of that selection bias because you're sorting all those those p values.   and selecting the behind...the low P values. And I find that renting a videotape changed the most, video cassette tape, and of course they became obsolete. So, of course, it changed.   Another video cassette tape changed a lot, collected things to recycling changed a lot, entertained people in my home changed a lot.   And here's the one that's less significant, but big effect size, do you use the internet. And of course this survey was started before there was an internet. And so, it changed a lot. So the question on what changed the most was   is easy now, where it used to be hard.   Another question you asked about big data, where are the missing values. So here are   280 variables.   Where are the missing values? Do I need to worry about them? And I can look at the explore missing values report and see that, well, there are only five of these variables with missing values.   And some of them only have one variable, one missing value, but some of them have a lot. So when I do an analysis, a multivariate analysis, I probably don't want to include   376 out of 452 variables. Or if I want to include J, I can go through   and impute those missing values   by doing that, okay.   So,   Next question, does the data have outliers?   Well, I have 387 measurements, I do in this process. And I want to find out if there are outliers in there. So now we have a facility to do it in the screening menu. I can   make this more or less sensitive.   Let's make it more sensitive   so there's fewer outliers and rescan.   I can look   at   the nines. So often, a string of nines is used to represent a missing value and those nines may be real nines, but they may be just an indicator of a missing value. And so I can, for those, I can say well add those nines to missing value codes and now the memory has changed.   And now I can   go back up (there's a lot of variables here) and rescan and there's fewer missing values to worry about.   So exploring outliers used to be a pain.   It still can be a big job, but it's a lot easier than it used to be.   Now, in version 15, we added another screening platform. Does the data have suspect patterns? And so here's some lab data from clinical trial. This is nicardipine lab patterns data. There's 27 laboratory results that I may want to look at. And so I invoke   the new platform, explore missing values. And this is going to show me, do I have a run of values? So I have the value .03536 but there's seven in a row, starting in row.   2065. I can colorize those cells, and I can look at those those values, and maybe it's the last value carried forward, which may not be suspicious. In this case, it's the...   it's the same person. So maybe last value carried forward is a reasonable thing to do, but it is a rare, rare event if you've distributed these things...   if, if you assume random distribution for these things.   Also longest duplicated sequence. So for this variable and we're starting in row 2816 and also starting in row 3034, there are four in a row that had the same values. So if I colorize those things, there's four in a row there. And if I   go to the next sequence, there's four in a row that have the same values. So that might be a symptom of cutting and pasting the same values from one place to another.   So explore patterns is looking for those things. And there are many other things that you can look at the details on each of those 27 variables and look for symptoms of suspicious things or or   bad effects of the way you processed the data and so on.   So,   explore patterns part of solving big problems.   So,   Pain Relief. Much of our development is focused on making the flow of analyses smoother, less burdensome, less time-consuming and less painful. Analgesics.   Now we don't always know what's painful. So we depend on feedback for what to focus on. When we get those emails that I had to respond to a prompt hundreds of times,   we listened to that and we feel the pain and we fixed it so that now you can broadcast into a by group with thousands of things   and broadcast the results of one dialogue rather than dialoguing many times. So sending it in made it better for everyone else, because we didn't catch it the first time around. So please give us your feedback.   With all the improvement we've already made, we think the process of data preparation and analysis has already become much smoother, much less interrupted, more in,flow.   So,   instead of spending your time getting over obstacles, you spend your time learning from your data, understanding your data.   analytics. One would wonder if they came from the same root, of course, we don't like to abbreviate those two words.   Analytics comes from the Greek analyein, and I don't know how to pronounce that.   But in Greek, it means to break up, a release, or set free. And it's taking something complicated and breaking it up into pieces so we can understand it.   And that's, of course, exactly what we do with analyzing data, data science. And analgesics comes from a different combination of words \!"an-\!" meaning without, and \!"algesis,\!" which is the sense of pain. So same prefix, different...a little bit different roots.   And don't abbreviate those.   anabolic and aphrodisiac. We care about power.   Much of what we do is to give you more powerful tools for analyzing data, much of that in JMP Pro, as well as in JMP.   And we hope that data makes it exciting, you know, the thrill of discovery.   The thrill of learning how to use power features in JMP and we think it's exciting. You know, it's an aphrodisiac.   So power and excitement are also value to us. It's not just pain relief.   So what are we going to do next year?   Well, big words starting with the letter B.   Start with A, next is B, right?   Well, next year we have JMP 16 coming. And so that's what we're going to talk about. Who knows what we're going to talk about the year after. So thank you much. And we hope you suffer very little pain in analyzing your data. Jeff Perkinson Thank you very much, John. We appreciate that. It was a fantastic talk and having been around to witness a lot of the pain over the years,   it is it is everything you say is absolutely true. Pain relief is an important thing for us.   We did have one question that came in, actually, a number of questions; we've answered some of them in the Q&A, but what I wanted to throw to you   it both as relief pain and provided some attraction and made JMP more powerful? John Sall Well, I think everyone's big delight is Graph Builder.   It feels incredibly powerful to just drag those variables over and do a few other clicks and you have the graph that you want, and you can change it so easily.   So it's, it's a thrill. It's a powerful feature and it's pain relief; it used to be harder to do. So I think that's everyone's favorite thing.   But of course there's...JMP is a rich product and we're proud of everything in it. Design of experiments, all   the great power involved in there and we've tried to make that process easy as well. And so many things come to mind. Jeff Perkinson Very good. Thank you very much, John, I appreciate it. If you have enjoyed this talk, I have two suggestions for you. One, we will be posting this
The purpose of this poster presentation is to display COVID-19 morbidity and mortality data available on-line from Our World in Data whose contributors ask the key question: “How many tests to find one COVID-19 case?” We use SAS JMP Analyze to help answer the question. Smoothing test data from Our World in Data, yields seven-day moving average or SMA(7) total tests per thousand in five countries for which coronavirus test data are reported: Belgium, Italy, South Korea, the United Kingdom and United States. Similarly, seven-day moving average or SMA(7) total cases per million of were derived using the Time Series Smoothing option. Coronavirus tests per case were calculated by dividing smoothed total tests by smoothed total cases and multiplying by a factor of 1,000. These ratios of smoothed tests to smoothed cases were themselves smoothed. Additionally, Box-Jenkins ARIMA(1,1,1) time series models were fitted to smoothed total deaths per million to graphically compare smoothed case-fatality rates with smoothed tests per case ratios.    (view in My Videos)   Auto generated transcript:     Auto-generated transcript...   Speaker Transcript Douglas Okamoto In our poster presentation we display COVID-19 data available from our world and data, who's database sponsors, ask the question why is data on testing important We use JMP version. To help us answer the question. Seven Day moving averages are calculated from January 21 to July 21 Daily per capita COVID-19 tests and coronavirus tests in seven countries United States, Italy, Spain, Germany, Great Britain, Belgium and South Korea. Core by owners test per case where calculated by dividing smooth test by smooth cases and multiplying by a factor 1000 Daily COVID-19 test data yields smoothed test data per thousand in Figure one Testing in LA states in blue trims upward with two tests per thousand daily on July 21st 10 times more than South Korea in red. Which trends downward The x axis is normalized the figure one, two days since moving averages number one or more tests per thousand. In figure two smooth coronavirus cases per million in Europe and South Korea trend downward after peaking months earlier than the US in blue, which averaged 2200 cases per month million on July 21st, with no end in sight. The x axis is normalized to the number of days since moving averages of 10 or more cases per million. Combining tabular results from figure one and figure to smooth COVID-19 test per case in Figure three shows South Korean testing in red peaks at 685 tests per case in May 38 times USP performance in lieu Of 22 tests per case in June. Since the x axis is dated figure three represents a time series. The reciprocal of tests for case cases protest is a measure of product to a positivity one in 22 or 4.5% of positive cases in the US compares with 0.15% positivity in South Korea. And 0.5 to 1.0% in Europe. At a March 30 who press briefing. Dr. Michael Ryan suggested a positive rate less than 10% or even better, less than 3% as a general benchmark of adequate testing. JMP analysis JMP analyzed was used to fit Box Jenkins time series models to smooth test per case in the US for March 13 of April 25 predictive values from April 26 two main ninth or forecast from a fitted model and auto-regressive integrated moving average or ARIMA 111 Model the figure for a time surge of smooth tests per case from mid March to April shows a rise in the number of us test for case not a decline as predicted during the 14 day forecast period. In summary, 10 or more test cases tests were performed per case to provide adequate testing in the United States COVID-19 testing in Europe and South Korea was more than adequate with hundreds of tests per case. Equivalent only the positive rate or number of cases protest was less than 10% in the US. Whereas positivity in Europe and South Korea was well under 3% When our poster was submitted the US totaled 4 million coronavirus cases more than your European countries and South Korea combined Us continues to be plagued by state by state disease outbreaks. Thank you.  
Pranjal Taskar, Formulation Scientist II, Thermo Fisher Scientific Brian Greco, Formulation Scientist I, Thermo Fisher Scientific Sabrina Zojwala, Formulation Scientist I, Thermo Fisher Scientific Kat Brookhart, Manager, Formulation & Process Development, Thermo Fisher Scientific Sanjay Konagurthu, Sr. Director, Science and Innovation, Drug Product NA Division Support, Thermo Fisher Scientific   Pharmaceutical tableting is a process in which an active moiety is blended with inert excipients to achieve a compressible mixture. This mixture is consolidated into the final dosage form: a tablet. The process of tableting considers different composition-related and process variables impacting quality attributes of the final product. This work focuses on using JMP software to identify main effects. An I-optimal, 19-run custom design was outlined with the factors being type and ratio of filler used (microcrystalline cellulose, mannitol vs lactose, categorical), percentage active spray dried dispersion loading (continuous), order and amount of addition (intragranular vs. extragranular, continuous), and ribbon solid fraction (continuous). The responses were outlined as bulk density, Hausner ratio, percentage fines, blend compressibility and tablet disintegration. The model evaluated with the main effects and second degree interaction terms. The data was evaluated using Standard Least Squares in the Fit Model function. Results determined that lactose provided the blend with a higher initial bulk density, however mannitol maintained bulk density post compression. Microcrystalline cellulose improved flow properties of the blend and high percentage intragranular addition provided material with higher bulk density and improved material flow.     Auto-generated transcript...   Speaker Transcript Pranjal Taskar All right. Thank you, Peter. So I'm going to get started now. Hello everyone. Today I'm going to talk about my poster. This poster is regarding systematic analysis of targeting, which includes effect of formulation and process variables on final quality attributes of my product. So delving into all the statistical analysis before that, I wanted to give a background about what exactly we're talking about. What is tableting? Tableting is a pharmaceutical process. Looking over in the introduction, I'm going to talk about what tableting is a little bit. It's a pharmaceutical process in which your active ingredient or active moiety (API) is blended with other excipients to form a free flowing good flowing blend and this blend is compressed into our final dosage form, which is a tablet. So in a lot of situations, there are some active moieties or APIs, as we would call it, that have a low bioavailability and that could be due to their crystalline nature. They're just too stable, too rigid in their ways. So our site kind of specializes into making this crystalline API, a little bit more soluble, little bit more reactive amorphous form and it makes it into like a more bioavailable form. And when we do that, we fortified this API by a polymer. This this intermediate that we form is a tablet intermediate called a spray dried intermediate, SDI. And this is what we basically use in our tablets as our active intermediate. But when you look at it, it has poor flow ability and it's extremely fluffy. So when you have to incorporate this API into your tablet, you need to have other pharmaceutical processes involved to make it more streamlined, to make the blend more flowable. So this is what we're going to do. In this study, we are going to identify our critical quality attributes, the variables that matter, or our dependent variables and then we are going to identify variables that impact our critical quality attributes, which are the composition of that tablet of that blend and then different process processing parameters that we used in us in tableting. Which of these are main effects? Are there any interactions? And then we'll use JMP to identify all of these main effect and interaction variables and try to catch out the tableting process basically. So this was the introduction. Moving on to the methods and objectives. So how do we do this? For this study we looked at a placebo formulation. There is no active product or actor moiety and we used a commonly used spray-dried polymer which is hypromellose acetate succinate. We spray dried it and made it into the fluffy blend that it usually is. And Figure 2 talks about our usual granulation tabulating process. So, what, what we do is basically have our spray-dried intermediate (SDI) blended along with other excipients using this blender. We move on to roller compaction, which is densification of this blend using these there are rollers right here and these rollers move slowly to densify the blend which goes into this hopper and you get ribbons out of the roller compactor. Now what you have done is you have made that fluffy material into densified ribbons and you mill it down using a comil. And you get granules. These granules are more dense and they are a lot better flowing than your API or your SDI. So looking at this entire process, there are a lot of variables that go in there that you need to change and look out for. So what are those variables? This diagram over here will identify different kinds of variables, the independent variable variables that go into the formulation and process. so The first variable would be a bit more base formulation related than the...rather than the process related. So it would talk about different types of ??? excipients that are used. And the ratio of these excipients that I used the percent of SDI loading, or active loading, and in our case, the placebo loading. And then the order of addition and the point of addition at where the SDI, or other excipients are loaded into the formulation. And then sorting process related parameters such as ribbon solid fraction, which basically talks about this equipment, the roller compactor and the speed at which the rollers and the spools move. We have also identified independent variables of our critical quality attributes that we look out for, which is bulk density of our blend, Which we look at before and after granulation and you have labeled it bulk density 1 and 2. Hausner ratio, which is again a ratio that depicts the flow of your blend and we also identify that before and after granulation, labeled as Hausner ratio 1 and 2. And the percent of fines that collect...are collected in the roller compaction process. And this is usually monitored after granulation. So all of these points out to talk about basically our method and why we chose our variables. What we did was we had an I optimal, 19-run custom design looking at all of these independent variables impacting on the dependent variables. And the way we analyze this model or the way we constructed effects, was that we looked at the main effect and the second degree interactions and we analyzed the data using the standard least squares personality in the fit model function. So, Identifying the process and the objectives, we will move on to results, but before doing that really quickly, I wanted to look at the JMP window which I have pulled up right now. These different columns are my independent and dependent variables and I'm going to highlight right here, these are the different independent variables that we are going to be looking at. So type of filler, which is the type of inert excipient and we have looked at mannitol and lactose. percent SDI, which is the active or in our case placebo loading, looking at highs and lows away here; and amount intragranular, so the amount of our excipients that we add before the roller compaction versus after the roller compaction and outline here are 75 and 95; and mannitol and lactose, which is a filler to MCC, which is micro course design cellulose ratio. Mannitol lactose are, I would say a little bit more excipient and MCC is more ???, gives more strength to the blend. So we have looked at a ratio of this to see how it impacts our tableting blend overall. And on the right are our responses. Bulk density 1, Hausner ratio 1, which is before granulation. Bulk density 2 and Hausner ratio 2, which is after granulation, and percent fines. So I'm gonna go over here quickly into this window and look at how we created our model, our response variables y, that I just talked about. And then our model effects which are secondary interactions and main effects. Standard least squares. That's what we used and I run the model. This is my effect summary right here and based on this data that we're looking at and prior experience, I'm going to take off the last two effects. Just remove that extra noise and then over here, I have my responses and how the data kind of impacts these responses. It would be just easier if we go down and look at the prediction profiler over here. And how all of these dependent variables are impacted by this. So I think it might just be easier if I pull up my poster and... Alright, so looking at the results over here, what we found out from Figure 3 was that, look, the two fillers lactose had higher bulk density initially, but post ruler compaction, the bulk density two of these fillers dropped and you can see a corresponding increase in the fines. So what we think would have happened is that lactose is more brittle in comparison to mannitol. And this generated all of that attrition and that fines and that impacted the flow, making it less bulky, drop in the bulk density. And the Hausner ratio, a little bit higher with the lactose. So basically, what we're doing is targeting a higher bulk density and we want a lower Hausner because a lower Hausner indicates a better flowing blend. So looking at the data, mannitol had a slight edge over lactose as a filler. And the, the second point would be talking about the solid fraction and overall we saw that there was a slight plateauing effect at around .6 solid fraction. Overall, we see that .7 has the least number of fines, which is why we see a recommended .7 with a maximum and desirability, but the plateau effect in terms of your flow properties (bulk and Hausner) start bottoming out at around .6 and onwards. that having lower SDI in general in the formulation had overall better flow properties. Just because the SDI, it's fluffy and it causes the blend to flow a lot worse. So the design just suggested us to have lower SDI loading. a higher amount of that ingredient of that excipient added in an intragranular fashion than an extragranular, just because it improves your bulk, it has a lower Hausner which means that your blend is flowing smoother. We also observed that mannitol to lactose ratio having more of that critical component was more desirable and I see that because overall, the fines have dropped in the presence of having a little bit more of the mannitol lactose component. And that could be the reason why we are seeing this. We also have in the Figure 4, a couple of surface plots of a few interesting trends that I saw. And in Figure 4A, you can see that having a lower SDI loading and having more amount intragranularly resulted in this hotspot right here of a very high Hausner ratio. So when you add a lot of...when you have a low....I'm sorry...have a higher SDI and higher intragranular had an extremely high Hausner ratio. So what this says is basically when you have more of that fluffy material intragranularly, your flow is going to be bad, but you correspond that after granulation, when you again have more more of your excipient intragranularly and you're targeting a solid fraction of about .6 and about, your bulk density improves. So you're basically post granulation, your blend is getting more denser and this is what these two diagrams talk about. So all of the result points basically talk about these things that I discussed right now. Overall, we conclude from our study that in order to optimize this process and maximize desirability for formulations, 1, a higher ratio intragranularly and a lower SDI loading would be a preferable formulation and targeting a solid fraction of around 0.6 would also be beneficial to the formulation. Thank you very much. I would welcome your questions.  
Carlos Ortega, Project Leader, Avantium Daria Otyuskaya, Project Leader, Avantium Hendrik Dathe, Services Director, Avantium   Creativity is at the center of any research and development program. Whether it is a fundamental research topic or the development of new applications, the basis of solid research rests on robust data that you can trust. Within Avantium, we focus on executing tailored catalysis R&D projects, which vary from customer to customer. This requires a flexible solution to judge the large amount of data that is obtained in our up to 64 reactor high-throughput catalyst testing equipment. We use JMP and JLS scripts to improve the data workflow and its integration. In any given project, the data is generated in different sources, including our proprietary catalyst testing equipment — Flowrence ® —, on-line and off-line analytical equipment (e.g., GC, S&N analyzers and SimDis) or manual data records (e.g., MS Excel files). The data from these sources are automatically checked by our JSL scripts and with the statistical methods available in JMP we are able to calculate key performance parameters, elaborate key performance plots and generate automatic reports that can be shared directly with the clients. The use of scripts guarantees that the data handling process is consistent, as every data set in a given project is treated the same way. This provides seamless integration of results and reports, which are ready-to-share on a software platform known to our customers.     Auto-generated transcript...   Speaker Transcript Carlos Ortega Yeah. Hi, and welcome to our presentation at the JMP Discovery Summit. Of course, we would have liked to give this presentation in person, but under the current circumstances, this is the best way we can still share the way we are using JMP in our day-to-day work and how it helps us actually on the more day-to-day work, how to rework our data. However, the presentation in this way with the video has also an advantage for you as a viewer because yeah if you want to grab a coffee right now you just can hit pause and continue when the coffee is ready. But looking at the time, I guess the summit is right now well under way. And most likely, you heard already quite some exciting presentations. How JMP can help you to make more sense out of the data to solve them a statistical tools to gain deeper insight and dive into more parts of your data. However, what we want to do today (and this is also hidden under the title about the data quality assurance), the scripting engine. Everything, which has to do with JSL scripting because we help...this helps us a lot on our day-to-day work to prepare the data, which are then ready to be used for data analysis and by we I mean Carlos Ortega, Daria Otyuskaya, and myself, which I now want to introduce a bit because, yeah, that's the to get a bit better feeling on who's doing this. But of course, as usual, there are some some rules to this, which are the disclaimer about the data we are using. And now if you're a lawyer for sure you're going to press pause to study this in detail. However, for all other people right now, let's dive into the presentation. And of course nothing better than to start with a short introduction of the people you see, you see already the location. We all have in common, which is Amsterdam in the Netherlands and we all have in common that we work at Avantium. company provider for sustainable technologies. However, the different locations we are coming from is all over the world. We have, on the one hand side on the left side, Carlos Ortega, a chemical engineer from Venezuela, which lives in Holland, about six years and works at Avantium about two years as a project leader and services. Then we have on the right side Daria Otyuskaya from Russia also working here for about two years and spending the last five years in the Benelux area where she made her PhD in chemical engineering. And myself. I have the only advantage, can that I can travel home by car as I origin from Germany. I live in Holland since about 10 years and join Avantium about three years ago. But now, let's talk a bit more about Avantium. I just want to briefly lay out a bit of the things we are doing. Avantium, as I mentioned before, provider for sustainable technologies and has three business units. One is Avantium Renewable Polymers, where we actually develop biodegradable polymer called a PEF, which is hundred percent plant based and recyclable. Second, we have a business unit called Avantium Renewable Chemistries, which offers renewable technologies to produce chemicals like MEG or industrial sugars from non food biomass. And last but not least, a very exciting technologies, where we turn CO2 from the air into chemicals via electro chemistry. But not too much to talk about these two business units because Carlos, myself and Daria are all working in the Avantium Catalysis, which was founded in 20 years ago and it's still the founding...the fundamental of Avantiums technology innovations. We are actually providing their We are a service provider in accelerating the research in your company in the catalysts research, to be more more specific. And we offer there, as you can see on the right hand side, systems services and a service called refinery catalyst testing. And what we help companies really to develop the R&D, as you see at the bottom for this. But this is enough about Avantium. Let's talk a bit how we are developing how we are working in projects and how JMP actually can help us there to accelerate the stuff and get better data out of it, which Carlos then later on the show in a demo for us. As mentioned before, we are a service provider and as a service provider, we get a lot of requests from customers to actually develop better catalysts, or better process. And now you might ask yourself, what's the catalyst. A catalyst is actually a material which participates in a reaction when you transform A to A, but doesn't get consumed in a reaction. The most common example of people, which you can see in your day-to-day life is, for example, the exhaust gas catalyst which is installed in your car, which turns off gases from your ...from your car into CO2 and water as an exhaust. And this is things which we get as requests. People come to us and say, "Oh, I would like to develop a new material," or things like, "I have this process, and I want to come with...accelerate my research and Develop a new process for this." And what they use there is when we have an experiment in our team, we are designing experiments of... designing experiments. We are trying to optimize the testing for this and is all we use JMP, but this is not what we want to talk today about. Because as I said before, we are using JMP also to actually merge our data, process them and make them ready for things, which is the two parts, which you see at the bottom of the presentation. We are executing research projects for customer in our proprietary tool called Flowrence, where the trick is that we don't experiment...don't execute tests, one after another, but we execute in parallel. Traditionally, I mean, I remember myself in my PhD, you execute a test one reactor after another, after another, after another. But we are applying up to 64 reactors in parallel, which makes the execute more challenging but allows a data-driven decision. It allows actually to make more reliable data and make them statistically significant. And then we are reporting this data to our customers, which then can either to continue in their tools with their further insights or completely actually rely on us for executing this data and extracting the knowledge. But yeah, enough about the company. And now let me hand over to Carlos, which will explain how JMP and JMP script actually helps us to make us our life significantly easier. Thank you, Hendrik,for the nice introduction. And thank you also for the organizers for this nice opportunity to participate in the JMP discovery summit. So as Hendrik was mentioning, we develop and execute research projects for third parties. And if we think about it, we need to go from design of experiments (and that's of course one very powerful feature from JMP), but also we need to manage information and in this case, as Hendrik was was mentioning, we want to focus on JSL script that allows us to easily handle information and create seamless integration of a process workflows. I'm a project leader in the R&D department and so a day...a regular day in my life here would look something like this. And so very simplistic view. You would have clients who are interested and have a research question and I design experiments and we execute these in our own proprietary technology called Flowrence. So in a simple view the data generated in the Flowrence unit will go through me after some checks and interpretation will goes back to the client. But the reality is somewhat more complex and on one hand, we also have internal customers. That is part of...for example our development team...business development team. And on the other side, we also have our own staff that actually interacts directly with the unit. So they control how the unit operates and monitor everything goes according to the plan. And the data, as you see here with broken lines, the data cannot be struck directly from the unit. The data is actually sent to a data warehouse and then we need a set of tools that allows us to first retrieve information, merge information that comes from different sources, execute a set of tasks that go from cleaning, processing, visualizing information, and eventually we export that data to the client so that the client can get the information that they actually need and that is most relevant for them. If you'll allow me to focus for one second on these different tasks, what we observed initially in the retrieve a merge is that data can actually come from various sources. So in the data warehouse, we actually collect data from the Florence unit, but we also collect data from the analyzer. So for those that they're performing tests in a laboratory, you might be familiar with the mass spectrometry or gas chromatography, for example, and we also collect data on the unit performance. So we also verify that the unit is is behaving as expected. In...as in any laboratory, we would also have manual inputs. And these could be, for example, information on the catalysts that we are testing or calibration of the analytical equipment. Those manual inputs are always of course stored in a laboratory notebook, but also we include that information into an Excel file. And this is where JMP is actually helping us drive the work flow of information to the next level. So what we have developed is a combination of an easy to use vastly known Excel file with powerful features from a JSL script. And not only we include manual data that is available in laboratory notebooks, but we also include in this Excel file formulas that are then interpreted by the JSL script and executed. That allows us to calculate key performance parameters that are tailored or specifically suited for different clients. If we look in more detail into the JSL script, and in a moment I will go into a demo, you will observe that the JSL script has three main sections. One section will prepare the local environment. So on one side we would say we want to clear all the symbols and close tables, but probably the most important feature is when we define "names default to here." So that would allow us actually to run parallel scrapes without having any interference between variables that are named the same in different scripts. Then we have section that is collapsed in this case so that we can show it actually that creates a graphical user interface. And then the user does not interact with the script itself, but actually works through a simple graphical user interface with the buttons that have descriptive button names. And then we have a set of tasks that are already coded in the script. In this case, they are in the form of expressions. Because well, it has two main advantages. One would be a it's easy to later on implement on the graphical user interface. And second, when you have an expression, you can use this expression several times in your code. OK, so moving on into the demo simulation. So I mentioned earlier that we have different sources of data. And on one side we have data that is in fact... that is in fact stored in our database. And this database will contain probably different sources of information, like the unit or different analyzers. In this case, you will see or you see an example Excel table. This only for illustration. So this data is actually taken from the data warehouse directly with our JSL script. So we don't look at this Excel table as a search. We let the software collect the information from the data warehouse. And probably what is most important is that this data, as you see here, can come again from different analyzers, and we're structuring somehow that the first column contains divided names. In this case, we have made some domain names. So, for reasons of confidentiality, but also you will see that all the observations are arranged in rows. So every single row is an observation. And depending on the type of test and the unit we are using, we could think that overall in one day we can collect up to half a million data points in one single day. That depends of course on the analyzer, but you immediately are faced with the amount of data that you have to handle and how JSL script that helps you process information can help you with this activity. Then we also use another Excel file. And this one is also very important, which is an input table file. And this files, specifically with the JSL script, are the ones creating the synergy to allows us to process data easy. What you see in this case, for example, is a reactor loading table and we see different reactors with different catalysts. And this information that seems... is not quantitative, but the qualitive the value is important. And then if we move to a second tab, and these steps are all predefined across our projects, we see the response factors for the analyzers. Different analyzers will have different response factors and it's important to log this information into use through the calculations to be able to get quantitative results. In this case, we observed that the condition that the response factors are targeted by condition instead. Then we have a formula tab. And this is probably a key tab for our script. You can input formulas in this Excel file. You make sure that the variable names are enclosed into square brackets. And the formula, you can use any formula in Excel. Anyone can use Excel; we're very much used to it. So if you type a formula here, that follows ??? syntax in Excel, it will be executed by our JSL script. Then we also included an additional feature we thought it was interesting to have conditionals. And for the JSL script to read this conditional, the only requirement is that the conditionals are enclosed in braces. There are two other tabs I would like to show you, which are highly relevant. One is a export tables tab and the reason that we have this table is because we generate many columns or many variables from my unit, probably 500 variables. But actually the client is only interested in 10, 20 or 50 of them. Those are the ones that really add value to their research. So we can input those variables here and send it to the client. And last but not least, I think many of us have been in that situation where we send an email to a wrong address and that can be actually something frightening when you're talking about confidential information. So we always double, triple check the email addresses and but does it...is it really necessary? So what we are doing here is that we have one Excel file that contains all manual inputs, including the email address of our clients. And these email addresses are fixed so there is no room for error. Whenever you have run the JSL script the right email addresses will be read and the email will be created and these we will see in one minute. So now going into the JSL script, I would like to highlight the following. So the JSL script is initially located in one single file in one single folder and the JSL script only needs one Excel file to write that contains different tabs that we just saw in the previous slide Once you open the JSL script, you can click on the run script button and that will open the graphical user interface that you see on the right. Here we have different options. In this case we want to highlight the option where we retrieve data from a project in that given period. We have selected here only one day this year, in particular, and then we see different buttons that allows us to create updates, for example. Once we have clicked on this button, you will see to the left on the folder that two directories were created. The fact that we create these directories automatically help us to have harmony or to standardize how is a folder structured also across our projects. If you look into the raw database data, you will see the two files were created. One contains the raw data that comes directly from the data warehouse. And the second, the data table contains all merge information from the Excel file and different tables that are available in the data warehouse. The exported files folder does not contain anything at this moment, because we have not evaluated and assessed the data that we created in our unit is actually relevant and valuable for the client. We do this, we are, we ??? and you see here that we have created a plot of reactor temperature versus the local time. And different reactors would be plotted so we have up to 64 in one of our units. And in this case we color the reactors, depending on the location on the unit. Another tab we have here, as an example, is about the pressure. And you see that you can also script maximum target and minimum values and define, for example, alerts to see if value is drifting away. The last table I want to show is a conversion and we see here different conversions collapsed by catalyst. So once we click the export button, we will see that our file is attached into an email and the email already contains the addresses...the email addresses we want to use. And again, I want to highlight how important it is to send the information to the right person. Now this data set is actually located into the exported files folder, which was not there before. And we always can keep track of what information has been exported and sent to the client. With this email then it's only a matter of filling in the information. So in this case, it's a very simple test. So this is your data set, but of course we would give some interpretation or gave maybe some advice to the client on how to continue the tests. And of course, once you have covered all these steps you will close the graphical user interface and that will also close all open tables and the JSL script. Something that I would like to highlight at this point is that these workflow using a JSL script is is rather fast. So what you saw at this moment, of course, it's a bit accelerated because it's only a demonstration, but you don't spend time looking for data and different sources, trying to merge them with the right columns. All these processes are integrated into a single script and that allows us to report to the client on a daily basis amounts of data that otherwise would be would...would not be possible. And the client can actually take data driven decisions with a very fast pace. That's probably the key message that I want to deliver with with this script that we see at this moment. Now, well, I would like to wrap up the presentation with with some concluding remarks and some closing remarks. And so on one side, we developed a distinctive approach for data handling and processing. And when we say distinctive it's because we have created a synergy between an Excel file that most people can use because you are very familiar with Microsoft Office and a JSL script which doesn't need any effort to run. So you click Run, you will get a graphical user interface and a few buttons to execute tasks. Then we have a standardized workflow. And that's also highly relevant when you work with multiple clients and also also from a practical point of view. For example, if one of my colleagues would go on holiday, it will be easy for another project leader for myself, for example, to take over the project and know that all the folder structures are the same, that all the scripts are the same and the buttons execute the same actions. Finally, we can also...we can guarantee seamless integration of data and these fast updates of information with thousands or even half a million data points per day can be quickly sent to clients and then this allows them to take almost online data driven decisions. At the end, our purpose is to maximize the customer satisfaction through a consistent, reliable and robust process. Well, with this, I would like to thank, again, the organizers of these discovery summit. Of course, to all our colleagues at Avantium, who have made this possible, especially to those that have worked intensively on the development of these scripts. And if you are curious about our company or the work we do in Catalysis, please visit one of the links you see here. And with this, I'd like to conclude, thank you very much for for your attention. And yeah, we look forward to your questions.  
Martin Kane, Managing Scientist, Exponent   Analytical methods for pharmaceutical development often require the use of dose-response curves and the fitting of an appropriate statistical model. Common functions used are the Rodbard and Hill function: different parameterizations of four parameter logistic functions. This presentation will discuss using JMP and the JMP Scripting Language to fit these non-linear functions, even when then are ill-behaved. Real-world data will be used to demonstrate how to use JMP’s various non-linear fitting routines and possible methods of dealing with messy data.     Auto-generated transcript...   Speaker Transcript mkane Okay. Hi everybody, my name is Martin Kane. I'm with Exponent and I'm here to talk about dose-response curve fitting for ill-behaved data here at the 2020 JMP Discovery Summit. First of all, I'd like to thank the conference advisory committee for inviting me to give this talk. And I really appreciate the opportunity and a chance to share the learnings that I've done through JMP. I use JMP all the time every day. And as a consultant, it becomes my primary tool for performing analysis. So this is something that I have been working on recently and thought it would be a good thing to share. So let's get into this. First of all, there's a disclaimer. The ideas in these slides belong to me, Martin Kane, do not necessarily represent those of my company Exponent. So with that being said, what are we going to talk about? So what are dose-response models? What shape do they often follow? Typical statistical models for those. How do we access these models in JMP? Difference between curve fit and nonlinear. What are the benefits of each? And what are the drawbacks of each? That's an area I will spend some time on. And we'll talk about initial values as well and the importance of having good solid initial values for these nonlinear models. I have a demonstration and I will then use the data in that demonstration to look at it ill-behaved data. What does that mean, and what can we do about it using the curve fit and nonlinear platforms in JMP? Okay. Dose-response models. So they can come in both linear and nonlinear formats. Typical models, though, are based on what we call the 3, 4 or 5 parameter logistic models. Those are very typical and there are many of them; there aren't just three. So what are the shapes of some of these models? Obviously linear, linear, excuse me, is a straight line, just a standard regression where we've plotted some sort of response concentration often versus...our log concentrations on the x axis versus the y axis, which is some sort of a response. And I will talk about it in just the next slide, but oftentimes this is based on some sort of fluorescence. And those values can be quite large, in terms of their range. So it's not uncommon to take the log values of those as well. We can have the exponential type model. This might work in some portion of the dose-response curve, but often is not sufficient for the entire curve. But more common than not is some form of... some sort of parameter logistic, and this example shows the four parameter logistic, the Hill function. This can also be called in a slightly different orientation, the Rodbard function. There's several different versions that have this same general shape, where we have some sort of upper asymptote, some sort of lower asymptote. There is a center point along this curve, somewhere halfway between the upper and lower asymptote, which we call the EC50 or IC50 value. That point is on the x axis, is how we use that. And then we also have a slope to this curve and this slope is here in this linear section. It's not truly linear, but it's close to linear. And the slope is the fourth parameter and down the bottom, that's this a parameter in this in this equation. d is our upper asymptote, that's the top; c is the lower asymptote at the bottom; I mentioned a was the slope; and b is the IC50 or EC50 value, that's the halfway point and the x axis concentration for that. So this is the equation of a typical 4 parameter logistic function. Okay. So, in the world of biologics and pharmaceuticals, a lot of the standard test method format is based on what's called an assay. And assays themselves are nothing more than a test method for biologics, for pharmaceuticals. In this particular case what we see here is a standard 96-well plate that's used for these types of assays. Each of these little circles actually represents a well, actually a little divot in a plastic plate where materials can be put. And so we can fill up all 96 wells, which will have some sort of binding agent on the bottom of the wells and some sort of fluorescence material in it as well. And so, once these are put under a certain wavelength of light, they will fluoresce. And depending on how much binding takes place, you'll get different fluorescence values. And like I said those fluorescence values can go anywhere from 10 to 100,000 or maybe even a million. It just depends on the format. It can be quite large, though. Typically on a plate, though, we will put seven or eight different concentrations of a curve, and the curve would be, as we showed in the previous slide, representative of one single material with various amounts for the concentrations. Typically we start at the top of the plate and we put the highest concentration and we might serially dilute that concentration down to the wells below it. So if the top starts out at say value of 16, we might do a 4-to-1 serial solution. And so we end up with four in the next column, one in the next row, excuse me, one-fourth, one-16th, and so on down the line. So we end up with serially diluted material going down the plate. Oftentimes there are duplicates, so columns five and six might have the same material, just in there twice, and that's so that we can get some form of variability in our curve. Oftentimes as well, when we're running an assay for a biologic or pharmaceutical, we will test multiple doses at the same time. So the point there is that on the same plate, at the same time, when we have various doses that we're trying to compare to one another. So for instance, columns 11 and 12 might have one dose, call it at one milligram per kilogram, and columns 9 and 10 might have a different dose at say, .1 milligrams per kilogram. And we might be wanting to compare those different doses at the same time. The other thing to mention is JMP has the ability to test for what's called parallelism. I'll talk about that. And built in, there are functions for testing parallelism using either the F test for the chi square method. Okay, so let's take a look at some data in JMP and get right into it. So here we go. Right over here on the left I have a JMP journal that I'm going to use for this demo. And this is for nonlinear bio assay materials and for ill-behaved data. The two platforms, as I mentioned earlier, that I will be discussing are the curve fit and the nonlinear platforms. So let's open up our sample data. Let's pull up some sample data. Now I initially had wanted to use an actual data from a client, they declined to let me do that. But the data that JMP has built in, in the bio assay sample data set, works just fine. It's very similar to what I would have used and we can use that. So first of all, let's take a look and see what we have. We have some sort of concentration, as I mentioned, the serial dilution. In this particular case, it looks like each row down is three-fourths of the row above it. There's some sort of log concentration. That's just log 10 of the concentration. Formulation looks like it has various formulations, or those could just as easily be doses. And toxicity, that may be the y value, our response could be fluorescence or log or fluorescence, something like that. So if we take a look at this data using Graph Builder, just because it's easy, we put toxicity in the y axis and maybe we put the log concentration on the x axis. We can see that there is a similar looking function to that 4 parameter logistic that I mentioned earlier, except it's reversed in terms of its direction. That's not a problem. The cubic spline that's used fits the data quite well. We can remove that and just look at the data. Now, obviously there's a lot of data here. Looks like there's four values per concentration. Oh, there was a formulation that we haven't talked about yet. And that, we could take that and we could do various things, right, in Graph Builder. We can put that in group y, and we can get four different curves out of this. Let's use cubic spline. standard, Test A, Test B, and Test C. Now you can change the colors. Those are harder to see because we have a single curve fit. We could also just take it and put it in the overlay area, which is the most common area that I typically use, and what this does is this actually fits an individual cubic spline for each of my various formulations. That's kind of nice. And we can see that three of them are quite similar, except for one is different. The green one. The green one is Test B. Okay, so let's remember that, Test B is different than the others. And standard is just one of the various formulations that are being looked at. So it looks like they're trying to compare three different tests against the standard, which is interesting. Okay, so I'm going to close this down. Now what sort of curve fit functions do we have for these nonlinears? So under analyze, specialized modeling, you have two that we can use. One is called curve fit and one is called nonlinear. Both of these will work for nonlinear data. And let's start with the curve fit function. So in the curve fit, we want to put some sort of y response toxicity in our y, and log concentration for our regressor. And initially, if I just say okay, what we get looks just like what we had in Graph Builder. There's one exception here and that exception is I can come up here under the red triangle linear, quadratic, cubic and so on. Sigmoidals logistics, probits, Gompertz. It doesn't tell you what the functions really look like or what their equations are. You have to know the right one. Well, I happen to know that I want the Hill function, and that's hidden here in the sigmoid curves, logistic curves, and here it is, fit logistic 4 parameter Hill. That's the function that I would like to use. So I can click on that and I get what looks a lot like what I had in Graph Builder, except now I actually have parameter estimates down at the bottom. Remember, we had a lower asymptote; an upper asymptote; a growth rate, which is the slope; and the inflection point, that's the EC50 value, that's the point halfway between the top and the bottom on the x axis. And those are the estimates. This is nice, but I really want separate graphs for each of my different formulations, so I'm going to redo this. I'm going to relaunch this except I'm going to add formulation. Now I could put formulation in the by category or in the group category. If I put in the by category and I click OK, I get four separate curves. That's not bad. And if I hold down my control key and I click on the red triangle for any of them, and I go to sigmoid logistic, 4 parameter Hill, what it will do is it will actually fit a curve for each of the four separately and give me the estimates for each of the four. So there's the first, one standard; here's Test A, its estimates; there's Test B with its estimates. Notice the asympts are different. And Test C. This is nice, but not still not exactly what I'm looking for. So I'm going to actually close this. I'm going to relaunch the analysis, except in this particular time, I'm going to take the formulation and put in the group category. And once again, by doing that, now I see all four are kind of overlaid on top of each other. And I can come up here and click on sigmoid, logistic, 4 parameter hill, and now what it shows me as the four different curve fits overlaid on top of each other in the plot. I can also get the parameter estimates for those four right here. So these should be identical to what we saw in the last screen. But visually now, I can take a look at these plots for the four overlaid on top of each other and see how they look. Do they look similar to each other or not? So this is, this is pretty good. I mean, this is, this might be good enough for what you might need. And if you want to pull these estimates out of this particular parameter... parameter estimates, I can right click on it and I can say make into a data table, which then allows me to take this data table with the estimates in it, and I could do something with that, whatever I happened to want. So that's, that's good to know. I'm going to close that out. So let's take a quick look at the nonlinear platform. So, analyze, specialized modeling, nonlinear. This looks similar. And if I put toxicity in my y response, I'll say formulation in my group, and log concentration in my x, and I say, okay. I get, oh wait, this fits...this says fit curve. We just did a fit curve, didn't we? And yes we did actually. This is identical to if I come up here under analyze, screening, fit curve, I say recall and I say, okay, they are identical. If I don't do anything different in the nonlinear platform, I actually end up in the fit curve platform. So what can we do that's different than the specialized modeling nonlinear? I'm going to see recall pull everything back in notice there's a model library on the left. And also notice this x could be a predicter formula, not just x values. So if I click on model library, I have a lot of models that I could choose from. And once again, I don't really know what these are. But notice, if I click on one, I can get a function, I can even say show graph. And it gives me a picture of it. Oh, that looks like something like I'm looking for. But it's kind of flipped in the wrong direction. So, this, this might work for me. Um, but what I don't see here is one called the Hill function. I see the Rodbard function. That's the five parameter, Richard. Where was it? Rodbard models here, that's similar, but there's no Hill function in all of it. So it's not exactly what I want. But it does allow me to do some things. The one thing that the nonlinear platform lets me do is it allows me to actually specify parameters themselves and lock them in. So what do I mean by that? Well, let's just say that I go to model library and I say logistic, 4 parameter. And I say, is it make formula, I believe? Toxicity here. Log concentration here. Oops. Formulation log concentration here and I say, okay, and this is standard. Nice. Okay. It actually does fit 4 parameters using a function that's not quite the function that I'm looking for. And you know this is not bad. But here, notice I can actually, like, change these different parameters using the sliders. That's kind of interesting. So I can say make formula and what it did is it actually put a formula on my data table here. If I take a look at the formula, it's over here and it has parameters with initial values and it has this big long equation for all four fit in the formula. So instead I can actually come back here now and I can put that in my x predicter and I can say okay. And what it does is it comes up and shows me all these, you know, the four functions and what the initial values were. If you click Go, it'll actually try and fit these. And notice that actually it did fit them in a count of four iterations. Pretty quick, actually, where there's a limit of 60, it fit them and fit them well. But notice I have in here the ability to change and I can change via sliders down here. Okay. Or I could change up here using actual...I could type in actual values, but I can change the current values and I can lock them in. This can be rather helpful and I will demonstrate that next with my ill-behaved data. I'm going to close this out and I'm going to close, get rid of this particular column. Okay, so we have this data set, our initial data set still there. Let's take a look at our ill-behaved data now. What I'm gonna do is I'm going to get rid of every, every other row of data and all of the low end concentration data points. So I'm going to push this button, thin data and eliminate lowest points. Every other one is gone, and now it's going to get rid of all the lowest ones. So what exactly does that do to our data? If we take a look at it. Let's just go over to fit model really quick, specialized, fit curve, excuse me, fit curve. And so recall. We only have the highest five points on the curve now. If I come up here and I say fit curve, and I do sigmoid, logistic...sigmoid, logistic, Hill. And it actually fit curves to those five data points separately. But notice something, I bring this up and bring this way out. Notice my lower asymptotes here are just completely different from one another. There's a third one. And if I keep going, eventually I'll get to the fourth one, which is way down here at like minus 80. So four lower asymptotes that are just completely different. So it doesn't make any sense, right? It fits the top part of the curves well, but it really, it really doesn't fit the bottom part of the curves well at all. The tops look pretty good but the bottoms don't. So rather than extrapolating, oftentimes when we're running these assays, the client, user, what they'll do is they'll put blanks, so material on the plate that has no concentration on purpose. This is usually some sort of background material that's in the assay itself. And in this particular case, I happen to know that blanks were used and the average of those blanks, as I say, down here was .5. So the average was .55. So really somewhere over here by the time we get to .55, all of the lower asymptotes should actually come down to .55. But I can't, I can't change that here in the fit curve platform. Ah, but I can in the nonlinear platform. But I don't have the Hill function in the nonlinear platform, so this gets kind of confusing. But we can get around this problem. Using a thick curve platform, once I fit my model, the logistic 4P Hill, I can actually save a formula. I can save a prediction formula. Or I can save a parametric prediction formula. And there's a difference here. The prediction formula saves these exact functions just exactly the way that they were, and the parametric prediction formula is a little bit different. In this particular case, this is the parametric prediction formula. If we take a look at it, what it shows is, it shows here are the four equations. Just as I thought they would be, so you know, if the formulation is standard, use this; if it's Test A use this. And down here are all the parameters. And actually, if one takes a look at this, if I was to copy this and paste it into another document... I do that really quick and I come up here and I say File, New, excuse me....file, new, will a journal work? Journal doesn't work. That's ok. So let's have a new script. Paste. There we go. Notice at the top are all of my parameters. At the bottom is actually the formulation...the formula that we were seeing over here on the right. So everything comes over, but these are initial values with the function itself. I just wanted to show that because it's kind of hidden unless you understand the parameters. But with this, I can now come over... come over (let me back up. Sorry.) to analyze, specialized modeling, nonlinear, and I can use this predicter in my x value. My y can be my toxicity; formulation is my group, I can say okay and now here are those same five data points per curve. And the strange looking plots, but notice I have the ability, again, to change things. I can...I know that the c parameter is my lower asymptote. So I can change each of these to .55 and I can lock that in. So .55. By doing this, you're seeing the curves actually changing. It's not, it's not fitting them yet, but it's allowing us to actually force a value that we believe is the correct value. So what I want to bring this up and brings over. What you'll see is that it made all of them .55 for the lower asymptotes. And now I could click on go, which actually is then going to be the fitting, what you see is that in just seven iterations it fit the rest of the parameters to those four curves, such that the lower asymptotes are all .55. That's great. That's exactly what I what I want in this particular case. And so once again under the red triangle, I can save a formula. In this case, I can't save the parametric prediction. I can just save that prediction, but I can do that. So I can use that over here. And where might I use that? Remember, I said that if I come over here under Graph Builder and I was to put my toxicity to my y axis; log concentration on the x; formulation, maybe in my overlay, this is what I see. I can also bring over here the formula...the formula...yes...yes...itself. And sometimes this can be useful. In this case they look really really similar. What I can do is I actually can take away the curve from my points and the smoother can be on the formulation itself. So this is the direction. And so actually, these are the curves that belong to the function themselves, not the smoother. And so this is one way to actually show the correct curves for this data set, even when the data is ill-behaved. And I'm saying ill-behaved here because we just don't have a lower asymptote but we have something that we can use in place of it. So that's what I wanted to show you. And I think this is really kind of cool. The thing is, you have to be able to go back and forth, or you need to at times, between the thick curve and the nonlinear platforms to get what you really want to get out of JMP, out of the functionality that you really need, but I think this is this is really great. So you could clear the row states, for instance, and you could actually show all of the data (I guess I should have wrapped this up, but I didn't), various formulations and overlay, log concentration And this is the curves, but you could actually use the prediction formula instead in this to to get the actual formulas that you that you want. And this is, this is really kind of nice to see. It's not something that is really talked about. There is a link, a blog link in the JMP discussions that Mark Bailey and somebody else put out, I think, two years ago that describes this methodology, a lot of this methodology. I just found it yesterday, long after I already figured it out myself, but I thought it was worth sharing to everybody, how we can fit these nonlinear models, especially in the dose-response world or these different biologics are pharmaceuticals, especially with all the talk these days of Covid 19 and there's a lot of work going on in this area. So with that, I guess I want to say thank you. Last but not least, I have some contact information. I am Martin in the JMP discussion forums and I post out there, somewhat frequently. My email address is also listed down here if you have any questions for me. So with that, thank you very much and I appreciate your time. Happy to take any questions. Thank you.  
Kelci Miclaus, Senior Manager Advanced Analytics R&D, JMP Life Sciences, SAS   Reporting, tracking and analyzing adverse events occurring to patients is critical in the safety assessment of a clinical trial. More and more, pharmaceutical companies and the regulatory agencies to whom they submit new drug applications are using JMP Clinical to help in this assessment. Typical biometric analysis programming teams may create pages and pages of static tables, listings and figures for medical monitors and reviewers. This leads to inefficiencies when the doctors that understand medical impacts of the occurrence of certain events can not directly interact with adverse event summaries. Yet even simple count and frequency distributions of adverse events are not always so simple to create. In this presentation we focus on key reports in JMP Clinical to compute adverse event counts, frequencies, incidence, incidence rates and time to event occurrence. The out of the box reports in JMP Clinical allow fully dynamic adverse event analysis to look easy even while performing complex computations that rely heavily on JMP formulas, data filters, custom-scripted column switchers and virtually joined tables.      Auto-generated transcript...   Speaker Transcript Kelci J. Miclaus Hello and welcome to JMP Discovery Online. Today I'll be talking about summarizing adverse event summaries and clinical trial analysis. I am the Senior Manager in the advanced analytics group for the JMP Life Sciences division here at SAS, and we work heavily with customers using genomic and clinical data in their research. So before I go through the summarizing and the details around using JMP with adverse event analyses, I want to introduce the JMP Clinical software which our team creates. JMP Clinical is one of the family of products that includes now five official products as well as add ins, which can extend JMP to really allow you to have as many types of vertical applications or extensions of JMP as you want. My development team supports JMP Genomics and JMP Clinical. JMP Genomics and JMP Clinical are respectively vertical applications that are customized, built on top of JMP, that are used for genomic research and clinical trial research. And today I'll be talking about how we've created reviews and analyses in JMP Clinical for pharmaceutical industries that are doing clinical trials safety and early efficacy analysis. The original purpose of JMP Clinical and the instigation of this product actually came through assistance to the FDA, which is a heavy JMP user And their CDER group, the Center for Drug Evaluation and Research. Their medical reviewers were commonly using JMP to help review drugs submissions. And they love it. They're very accomplished with it. One of the things they found though is that certain repetitive actions, especially on very standard clinical data could be pretty painful. Example here is the idea of something called a shift plot which is for laboratory measurements where you compare the trial average of a laboratory of versus the baseline against treatment groups. In order to create this, it took at least eight to 10 steps within the JMP interface of opening up the data, normalizing the data, subsetting it out into baseline versus trial, doing statistics, respectively, for those groups merging it back in, then splitting that data by lab tests so you could make this type of plot for each lab. And that's not even to get to the number of steps within Graph Builder to build it. So JMP clearly can do it, but what we wanted to do is solve their pain at this very standard type of clinical data with a one-click lab shift plots, for example. In fact, we wanted to create clinical reviews in our infrastructure that we call the review builder that are one-click standardized reproducible reviews for many of the highly common standard analyses and visualizations that are required or expected in clinical trial research to evaluate drug safety and efficacy. So JMP Clinical has evolved since that first instigation of creating a custom application for a shift plot into a full-service clinical...clinical trial analysis software that covers medical monitoring and clinical data science, medical writing teams, biometrics and biostatistics, as well as data management around the study data involved with clinical trial collection. This goes for both safety and efficacy but also operational integrity or operational anomalies that might be found in the collection of clinical data as well. Some of the key features around JMP Clinical that we find to be especially useful for those that are using the JMP interface for any types of analyses are things like virtual joins. So we have an idea of a global review subject filter, which I'll show you during the demonstrations for adverse events, that really allow you to integrate and link the demography information or the demographics about our subjects on a clinical trial to all of the clinical domain data that's collected. And this architecture, which is enabled by virtual joins within the JMP interface with row state synchronization, allow you to really have instantaneous interactive reviews with very little to no data manipulation across all the types of analyses you might be doing in a clinical trial data analysis. Another new feature we've added to the software that also leverages some of the power of the JMP data filter, as well as creation of JMP indicator columns, is this ability to, while you're interactively reviewing clinical trial data, find interesting signals that say, in this example, the screenshot shown is subjects that had a serious adverse event while on the clinical trial, find those interesting signals, and quite immediately, create an indicator flag that is stored in metadata with your study in JMP Clinical that's available for all other types of analyses you might do. So you can say, I want to look now at my laboratory results for patients that had a serious adverse event versus those that didn't to see if there's also anomalies that might be related to an adverse event severity occurrence. Another feature that I'll also be showing with JMP Cclinical and the demonstration around adverse event analysis is the JMP Clinical API that we've built into the system. One of the most difficult things of providing and creating and developing a vertical application that has out-of-the box one-click reports is that you get 90% of the way there and then the customer might say, oh, well, I really wanted to tweak it, or I really wanted to look at it this way, or I need to change the way the data view shows up. So one of the things we've been working hard on in our development team is using JMP scripting JSL to surface an API into the clinical review, to have control over the objects and the displays and the dashboards and the analyses and even the data sets that go into our clinical reviews. So I'll also be showing some of that in the adverse event analysis. So let's back up a little bit and go into the meat of adverse events and clinical trials now that we have an overview of JMP Clinical. There's really two kind of key ways of thinking of this. There's that safety review aspect of a clinical trial where that's typically counts and percentages of the adverse events that might occur. And a lot of the medical doctors, monitors, or reviewers often use this data to understand medical anomalies, you know, a certain adverse event starts showing up more commonly, with one of the treatments that could have medical implications. There's also the statistical signal detection, the idea of statistically assessing our adverse events occurring at an unusual rate in one of the treatment groups versus the other. So here, for example, is a traditional static table that you see in many of the types of research or submissions or communications around a clinical trial adverse event analysis. Basically it's a static table with counts percents and if it is more statistically oriented, you'll see things like confidence intervals and p values as well around things like odds ratios or a relative risks or rate differences. Another way of viewing this can also be visually instead of with a tabular format so signal detection, looking at say odds ratio or the, the risk difference might use the Graph Builder in this case to show the results of a statistical analysis of the incidence of certain adverse events and how they differ between treatment groups, for example. So those are two examples. And in fact, from the work we've done and the customers we've worked with around how they view and have to analyze adverse events, the JMP Clinical system now offers several common adverse event analyses from simple counts and percentages to incidence rates or occurrences into statistical metrics such as risk difference, relative risk, odds ratio, including some exposure adjusted time to event analyses. We can also get a lot more complex with the types of models we fit and really go into mixed or Bayesian models as well in finding certain signals with our adverse event differences. And also we use this data heavily in reviewing just the medical data in either a medical writing narrative or patient profile. So now I'm going to jump right into JMP Clinical with a review that I've built around many of these common analyses. So one of the things you'll notice about JMP Clinical is it doesn't exactly look like JMP, but it is. It's a combined integrated solution that has a lot of custom JSL scripting to build our own types of interfaces. So our starter window here lays out studies, reviews, and settings, for example. And I already have a review built here that is using our example nicardapine data. This is data that's shipped with the product. It's also available in the JMP sample library. It's a real clinical trial, looking at subarachnoid hemorrhage. It was with about 900 patients. And so what this first tab of our review is looking at is just the distribution of demographic features of those patients, how many were males versus females, their race breakdowns, what treatment group they were given, their sites that the data was taken from, etc. So this is very common, just as the first step of understanding your clinical data for a clinical trial. You'll notice here we have a report navigator that shows the rest of the types of analyses that are available to us in this built review. I'm going to walk through each of these tabs, just quickly to show you all the different flavors of ways we can look at adverse events with the clinical trial data set. Now, the typical way data is collected with clinical trials is an international standard called CDISC format, which typically means that we have a very stacked data set format. Here we can see it, where we have multiple records for each subject indicating the different adverse events that might have occurred over time. This data is going to be paired with the demography data, which is one row per each subject as seen here in this demographic. So we have about 900 patients and you'll see in this first report, we have about 5,000 or 5,500 records of different adverse events that occurred. So this is probably the most commonly used reports by many of the medical monitors and medical reviewers that are assessing adverse event signals. What we have here is basically a dashboard that combines a Graph Builder counts plot with an accompanying table, as they are used to seeing these kind of tables. Now the real value of JMP is its interactivity and that dynamic link directly to your data so that you can select anywhere in the data and see it in both places. Or more powerfully, you can control your views with column switchers. Now here we can actually switch from looking at distribution of treatments to sex versus race. You'll notice with race, if we remember, we had quite a few that were white in this study, so this isn't a great plot when we look at it by percent or by counts, so we might normalize and show percents instead. And we can also just decide to look at the overall holistic counts of adverse events as well. Another part of using this as this column switcher is the ability to you know categorize what kind of events those were. Was it a serious adverse event? What was the severity of it? Was the outcome that they are when they recovered from it or not? What was causing it? Was it related to study drug? All of these are questions that medical reviews will often ask to find interesting or anomalous signals with adverse events in their occurrences. Now one of the things you might have already noticed in this dashboard is that I have a control group as column switcher here that's actually controlling both my graph and my table. So when I switched to severity, this table switches as well. This was done with a lot of custom JSL scripting specifically to our purposes, but I'll tell you a secret, in 16 the developer for column switcher is going to allow us to have this type of flexibility so you can tie multiple platform objects into the same columns switcher to drive a complex analysis. I'm going to come back to this occurrence plot, even though it looks simple. Here's another instance of it that's actually looking at overall occurrence where certain adverse events might have occurred multiple times to the same subject. I'm going to come back to these but kind of quickly go through the rest of the analyses and these reviews before coming back to some of the complexities of the simple graph builder and tabulate distribution reports. The next section in our review here is an adverse event incident screen. So here we're making that progression from just looking at counts and frequencies or possibly incidence rates into more statistical framework of testing for the difference in incidence of certain adverse events in one treatment group for another. And here we are representing that with a volcano plot. So we can see actually that phlebitis, hypotension and isothenuria occur much more often in our treatment group, those that were treated with nicardipine, versus those on placebo. So we can actually select those and drill into a very common view for adverse events, which is our relative risk for a cell plot as well, which is lots of lot of times still easier to read when you're only looking at those interesting signals that have possibly clinical or statistical significant differences. Sometimes clinical trials take a long time. Sometimes they're on them for a few weeks, like this study was only a few weeks, but sometimes they're on them for years. So sometimes it's interesting to think of adverse event incidents differences as the trial progresses. We have this capability as well within the incidence screen report where you can actually chunk up the study day, study days into sections to see how the incidents of adverse events change over time. And a good way to demonstrate that might be with an exploding volcano plot here that shows how those signals change across the progression of the study. So another powerful idea with this, especially as you have longer clinical trials or more complex clinical trials, is instead of looking at just direct incidence among subjects you can consider their time to event or their exposure adjusted rate at which those adverse events are occurring. And that's what we offer within our time to event analyses, which once again, shown in a volcano plot looking here using a Kaplan Meier test at differences in the time to event of certain events that occur on a clinical trial. One of the nice things here is that you can select these events and drill down into the JMP survival platform to get the full details for each of the adverse events that had perhaps different time to event outcomes between the treatment groups. Another flavor of time to event is often called an incidence density ratio, which is the idea of exposure adjusted incidence density. Basically the difference here is instead of using some of the more traditional proportional hazards or Kaplan Meier analyses, this is more like a a poisson style distribution that's adjusted for how long they've actually been exposed to a drug. And once again here we can look at those top signals and drill down to the analogous report within JMP using a generalized linear model for that specific type of model with an adverse event signal detection. And we actually even offer some really complex Bayesian analyses. So one of the things with with this type of data is typically adverse events exist within certain body systems or classes...organ classes. And so there is a lot of posts...or prior knowledge that we can impose into these models. And so some of our customers, their biometrics teams decide to use pretty sophisticated models when looking at their adverse events. So, so far we've walked from what I would say consider pretty simplistic distribution views of the data into distributions and just count plots of adverse events into very complex statistical analyses. I'm going to come back now, back to what is that considered simple count and frequency information and I want to spend some time here showing the power of JMP interactivity that we have. As you recall one of the differences here is that this table is a stacked table that has all of the occurrences of our adverse events for each subject, and our demography table, which we know we have 900 subjects, is separate. So what we wanted was not a static graph, like we have here, or what we would have in a typical report in a PDF form, but we wanted to be able to interactively explore our data and look at subgroups of our data and see how those percentages would change. Now, the difficulty is that the percent calculation needs to come from the subject count in a different table. So we've actually done this by formula...like creating column formulas to dynamically control recalculation of percents upon selection, either within categorizing events or, more powerfully, using our review subject filter tool. So here for example, we're looking at all subjects by treatment. Perhaps serious versus not serious adverse events, but we can use this global data filter which affects each of the subject level reports in our review and instantaneously change our demography groups and change our percentages to be interactive to this type of subgroup exploration. So here, now we can actually subgroup down to white females and see what their adverse event percentage and talents are, or perhaps you want to go more granular and understand for each site, how their data is changing for different sites. So what we really have here is instead of a submission package or a clinical analysis where the biometrics team hands 70 different plots and tables to the medical reviewer to go through, sift through, they have the power to create hundreds of different tables and different subsets and different graphics, all in one interface. In fact, you can really filter down into those interesting categories. So if they were looking say at serious adverse events and they wanted to know serious adverse events that were related to drug treatment very quickly, now we got down to a very small subset from our 900 patients to about nine patients that experienced serious adverse events that were considered related to the treatment. So as a medical reviewer this is a place where Ithen might want to understand all of the clinical details about these patients. And very quickly, I can use one of our action buttons from the report to drill down to what's called a kind of a complete patient profile. So here we see all of the information now, instead of at a summary level, at a subject individual level of everything that occurred to this patient over time, including when they had serious adverse events occur and their laboratory or vital measurements that were taken alongside of that. One of the other main uses of our JMP Clinical system along with this medical review, medical monitor is medical writing teams. So another way of looking at this instead of visually in a graphic or even in a table which these are patient profile tables, you can actually go up here and generate an automated narrative. So here we're going to actually launch to our adverse event narrative generation. Again, one of the benefits and values of our JMP Clinical being a vertical application relying on standard data is that we get to know all the data and the way it is formatted up up up front, just by being pointed to the study. So what we can do here is actually run this narrative that is going to write us the actual story of each of those adverse events that occurred. And this is going to open up a Word doc that has all of the details for this subject, their demography, their medical history, and then each of the adverse events and the outcomes or other issues around those adverse events. And we can do this for one patient at a time or we can actually even do this for all 900 patients at a time and include more complex details like laboratory measurements, vitals, either a baseline or before. And so, medical reviewers find this incredibly valuable be able to standardly take data sources and not make errors in a data transfer from a numeric table to an actual narrative. So I think just with that you can really see some of the power of these distribution views, these count plots that allow you to drill into very granular levels of the data. This ability to use subject filters to look either within the entire population of your patients on a clinical trial or within relevant subgroups that you may have found. Now one thing about the way our global filter works through our virtual joins is this is only information that's typically showing the information about the demography. One of the other custom tools that we've scripted into this system is that ability to say, select all subjects with a serious adverse event. And we can either derive a population flag and then use that in further analyses or we can even throw that subject's filter set to our global filter and now we're only looking at serious...at a subject who had a serious adverse event, which was about...almost 300 patients on the clinical trial had a serious adverse event. Now, even this report, you'll see is actually filtered. So the second report is a different type of aspect of a distribution of adverse events that was new in our latest version which is incidence rates. And here, the idea is instead of normalizing or dividing to calculate a percent by the number of subjects who had an event. If you are going with ongoing trials or long trials or study trials across different countries that have different timing startup times, you might want to actually look at the rate at which adverse events occur. And so that's what this is calculating. So in this case, we're actually subset down to any subjects that had a serious adverse event. And we can see the rate of occurrence in patient years. So for example, this very first one, see, has about a rate of 86 occurrences in every 10 patient years on placebo versus 71 occurrences In nicardipine. So this was actually one which this was to treat subarachnoid hemorrhage, intracranial pressure increasing likely would happen if you're not being treated with an active drug. These percents are also completely dynamic, these these incidence rates. So once again, these are all being done by JMP formulas that feed into the table automatically that respect different populations as they're selected by this global filter. So we can look just within say the USA and see the rates and how they change, including the normalized patient years based on the patients that are from just the USA, for example. So even though these reports look pretty simple, the complexity of JSL coding that goes beyond building this into a dashboard is basically what our team does all day. We try to do this so that you have a dashboard that helps you explore the data as you know, easily without all of these manipulations that could get very complex. Now the last thing I wanted to show is the idea of this custom report or customized report. So this is a great place to show it too, because we're looking here at adverse events incidence rates. And so we're looking by each event. And we have the count, or you can also change that to that incidence rate of how often it occurs by patient year. And then an alternative view might be really wanting to see these occurrences of adverse events across time. And so I want to show that really quick with our clinical API. So the data table here is fully available to you. One of the things I need to do first off is just create a numeric date variable, which we have a little widget for doing that in the data table, and I'm going to turn that into a numeric date. Now you'll notice now this has a new column at the end of the numeric start date time of the adverse event. You'll also notice here is where all that power comes from the formulas. These are all actually formulas that are dynamically regenerated based on populations for creating these views. So now that we have a numeric date with this data, now we might want to augment this analysis to include a new type of plot. And I have a script to do that. One of the things I'm going to do right off the bat is just create a couple extra columns in our data set for month and year. And then this next bit of JSL is our clinical API calls. And I'm not going to go into the details of this except for that it's a way of hooking ourselves into the clinical review and gaining access to the sections. So when I run this code, it's actually going to insert a new section into my clinical review. And here now, I have a new view of looking at the adverse events as they occurred across year by month for all of the subjects in my clinical trial. So one of the powers, again, even with this custom view is that this table by being still virtually joined to our main group can still fully respond to that virtual join global subject filter. And so just with a little bit of custom API JSL code, we can take these very standard out-of-the-box reports and customize them with our own types of analyses as well. So I know that was quite a lot of an overview of both JMP Clinical but, as well as the types of clinical adverse event analyses that the system can do and that are common for those working in the drug industry or pharma industry for clinical trials, but I hope you found this section valuable and interesting even if you don't work in the pharma area. One of the best examples of what JMP Clinical is is just an extreme extension and the power of JSL to create an incredibly custom applications. So maybe you aren't working with adverse events, but you see some things here that can inspire you to create custom dashboards or custom add ins for your own types of analyses within JMP. Thank you.  
Bill Worley, JMP Senior Global Enablement Engineer, SAS   In the recent past partial least squares has been used to build predictive models for spectral data. A newer approach using Functional Data Explorer and covariate design of experiments will be shown that will allow for fewer spectra to be used in the development of a good predictive model. This method uses one-fourth to one-third of the data that would otherwise be used to build a predictive model based on spectral data. Newer multivariate platforms like Model Driven Multivariate Control Charts (MDMCC) will also be shown as ways to enhance spectral data analysis.   (view in My Videos) Auto-generated tr     Auto-generated transcript...   Speaker Transcript Bill Worley Hello everyone, my name is Bill Worley and today we're going to be talking about analyzing spectral data. I'm going to talk about a few different ways to do it. One is using functional data explorer and design of experiments to help build better predictive models for your spectral data. The data set I'm going to be using is actually out of a JMP book discovering partial least squares. I will post this on the our discovery website, though, or community page so everything will be out there for you to use. First and foremost, I'll talk about these different things we're going to look at. So traditionally when you're looking at spectral data you're going to use partial least squares to analyze it and that's fine. And it really works very well. But there are some newer approaches that try it out and One is using principal components and then using a covariant design of experiments and partial least squares, then to analyze the data. And then even newer approach and more novel approaches using functional data explorer, then the covariance design of experiments partial least squares and an opportunity to use something like generalized regression or neural networks. Okay, so I'm going to go through a PowerPoint first give you a little bit of background. Okay. And again, we're going to be talking about using functional data explorer and design of experiments to build better predictive models for your spectral data. A little bit of history. The spectral data approach is based on a QSAR-like material selection approach that was developed previously by gentleman named Silvio Michio and Cy Wegman. So I took it and looked for opportunities to help build this approach with other highly correlated data. The first thing that really came out was the spectral data which is truly highly correlated almost auto correlated data that we can look at and use this approach. The data that I've got is, again, this continuous response data for octane rating. But I've since added, looking at mass spectral data in near IR data for categorical responses as well. This is where we're going to go We're going to build these models, we're going to compare them. And we're going to look at This is the traditional PLS approach. This is the newer approach using principal components. And then the final approach here is using functional data explorer and you can see that for the most part, we really don't lose anything with these models as we build them. As matter of fact, the slides a little bit older, the models that I've built more recently are actually a little bit better. So we'll get there. We'll show you that when we get there. So again, a twist on analyzing your spectral data partial least squares has been used in the past. We're going to be applying several multivariate techniques to show how to build good predictive models with far fewer conditions. So when I say far fewer conditions, in this case, I mean less spectra. So you'll see where that comes from. And why would you want to do this analysis differently? Well, first and foremost, there's a huge time savings. You get as good or better predictive model with 25% of the data or less. It's your choice, and then you can use penalize regression to help determine the important factors, making for simpler models. And when I say important factors, I mean important way wavelengths and again, you'll see that when we get there. This is looking at 60 gasoline near IR spectra overlay and might, as we all know, that would be pretty hard to build a predictive model to determine the difference between the different spectra for their octane rating. So what we're going to do is use JMP to help get us there. And most of this kit that I'm going to be showing can be done in regular JMP but what I'm going to be showing today is almost all JMP Pro. Okay, just so you know. So how it's done. I'm not going to read you the different steps. I'll let you do that if you so choose. But there are two important ones. First is number two, when you want to identify these prominent dimensions in the spectra. And that's where we're going to use functional principal components from the functional data explorer. It's not used in the traditional sense, because we're not going to build models using these things. We're going to use these functional principal components to help us pick which spectra we're going to analyze and then we're going to use those in a custom design to help us select those different spectra. And last but not least, this number seven here is use this sustainable model to determine the outcome for all future samples. This that's a little bit of it. I'm a chemist by training by education and an analytical chemist at that and overall I don't know how well a calibrated or instruments hold their calibration anymore. So this holds true, the model that you build will hold true as long as the instrument is calibrated and good to go from that respect. Okay. Bill Worley So some important concepts. And again, I'm not going to read these. I just want you to know that will be looking at partial least squares, principal components, and functional data analysis. Functional data analysis, really this is something newer in JMP that a lot of our, you won't see in other places. It helps you analyze data, providing information about curved surfaces or anything over a continuum. And that's taken from Wikipedia. Okay. A newer platform that I'm going to show that is in regular JMP is multivariate control, model driven multivariate control charts. And this allows you to see where the differences in some of the spectra and how you can pull those apart and maybe dive a little deeper into where you're seeing these differences in the spectra of what they really are. So with that, let's go to the demo. And you go to my home window. So, The data set. This is the again the gasoline data set where we're looking at octane rating as, you know, how do we determine octane rating for different gasolines. Right. So where do they come from, how do we determine that. You really don't need this value until you do this preprocessing or setup that I'm going to be showing you. So we'll get there, we know those numbers are there. We'll get there as we need to. But let's look at the data first. And that's always something you want to do as you want to do anyway. Whenever you get a new data set, you want to look at the data and see where things fall out. So let's go to Graph Builder. We're going to pull in octane and the wavelengths. Alright, so we're going to drop those down on our x axis. And before I do that, let me close that out. I want to color and mark by the octane reading and I'm going to change the scale. The green to black to red Say, Okay, let's see that colors in the data set. Let's go back to Graph Builder. And we'll pull this back in. Drop those. Little more colorful there. Now we've got these there, it's really hard to tell anything at all. I mean, we wouldn't know anything, you know. What we saw before the overlay was bad enough, but we're looking at, you know, more jumbled grouping of points, but let's turn on the parallel plots. Alright, so again, that kind of pulls things in and we can see again a jumbled mess, but we've got another tool that will help us investigate the data a little further. And that's the local data filter. So we're going to go there and we're going to pull in sample number and octane rating. We'll add those you transfer this out a little bit so it's not so See that. So now we could actually go into the single spectra. See over here in the green, so we can dive into those separately. I'm going to take that back off. Alright, so that's grouping and that we could actually pull this in and start looking at the different octane ratings right and see which spectra associated with the higher octane ratings are the lower. It's your choice. It's just gives you a tool to investigate the data. Do you need to do any more pre rocessing to get the spectra in line with each other or setup where you get you can see the different groupings better. Okay, so that's looking at Graph Builder. And I'm going to clear out the row states here. From here, we want to better understand what's going on with the data. Like I said, this is We're looking at spectral data and it's very highly colinear or multi colinear and this is something you may want to prove to yourself. So let's go to analyze multivariate methods, multi variant. And we're going to select all our wavelengths Right. And fairly quickly, we get back that You know, we get these Pearson correlation coefficients. And they're all closer to one, right, for the most part in these early early wavelengths. And that's just telling us that things are very highly correlated. So, you know, they'll figure that's that's the way it is. And, you know, we need to deal with that as we go forward. Okay, so we're looking at the data, we're set up and now we can look at another piece of information and this is newer in JMP 15 It's also in regular JMP. So we're going to go to Analyze Quality and process, and model driven multivariate control chart. Okay, so again we pull in all our wavelengths Okay, say Okay. And now we're looking at the data in a different way. This is basically for every spectra, all 400 wavelengths. And now we can see where some of these are little bit out of what would be considered control. All right, for all 400 wavelengths. That's the red line. So if we highlight those Right, I know if I right click in there and say contribution plots are selected samples. Now I can see Differences in the spectra compared to some, you know, as they're compared to the other spectra in the overall data set, we can see which parts of the spectra that are considered more or less out of control. And if I can get this to work. We can get there and then for that particular wavelength, we can see those three samples, you know, are out of spec, more or less out of spec, based on this control chart for the compared to the other samples. Alright, so again, that allows you to dive deeper and you know, tells you what group it is, again, this is all about learning more about your data, which ones are good, which ones are bad or which ones may be different. Right. So it's an added tool to help you better understand where you may be seeing differences. Okay. So with that, we've got things pretty much set up right and we want to go into the the analysis part. So as we go into Analyze, we have to set things up so we get what we want, when we want and how we want to analyze it. So we're going to go to Analyze and this is where we're going to select the samples. This is where we're going to use functional data explorer to help us select the samples. Alright, so go to Analyze, functional data explorer. And this is a JMP Pro 15 thing. So we're going to use instead of stack data, we're going to use rows as functions. And again, we're going to use select all of our wavelengths We're going to use octane as our supplementaries variable, right. And then the sample number is our ID function. Right, so we've got it set up, ready to go. And now for looking at the data. Remember how we had everything lined up and we were looking at it before. So this is all the data overlaid again. And if we needed to do some more preprocessing, we can do that over here in the transform area where we could actually center the data and standardize it For the most part, this data is fairly clean. We don't have to do that. And we're going to go ahead and do the analysis from here. Okay, so b-splines, p-splines and fourier basis. These clients will give you a decent model and a fairly simple model. Spectral data is again so highly correlated and the data, the wavelengths are so close, we want to understand where we're seeing differences on a much closer basis as opposed to something like a b-spline, which would spread the knots out. All right. We want to cap the knots as close together as possible to help better understand what's going on. So this takes a few seconds to run, but I'm going to click p-splines And it gives you an idea of, you know, so it's going to take, I don't know, 15 or 20 seconds to run, but it's going to fit all those models. And it's almost done. Alright, so now we fit those models. Now if I had run a b-spline, it would probably would have been around 20 knots and most We're looking at 200 knots. So it's basically taking that those 400 wavelengths and split them into virtually groups of two, right, so it's looking at individual, like individual groups of two And this is the best overall model based on these AICc BIC scores negative log likelihood. We could go backwards. It's a linear function, we could go backwards and use a simpler model, if we want. We could also go forward and see how many more knots would take to get you an even better model. I can tell you from experience. That around 203 to 204 knots is as good as it gets. And there's no reason to really go that far, you know, for that little bit of effort, or a little bit of improvement that we would get SO fit those now. The, the, you can see we fit all the models are all the spectra and let's go down to our functional data explorer or functional principal components. This is the mean spectra right here. These are the functional principal components from that data. Okay. And each one of these Eigenvalues are punctual principal components is explaining some portion of the variation that we're saying, all right. So you can see the first functional principal component component explains about 50% of the variation, so on and so forth. And it's additive You can see it about the second row at 72 and then number four we get to basically our rule of thumb or cut off point, that if we can explain 85% of the variation That's our cut off for the number principal components we want to grab and build our DOE off of. And so we're going to go with four. Right. Some other things you can look at are like the score plots. So looks like spectra number five is kind of out there. And if you really want it to look at that one, you could as well so you can pop that out or pin that to the graph. But you get an idea of, you know, which spectra is out there and what it might look like, in this case we see some differences to 15 and five, remember 41 was kind of out there too. But we can see some other things. The functional principal component profiler's down here. Now if you wanted to make changes or you wanted to better understand things, then you would say, Okay, you know, as I move my functional principal components around. What do I do, you know, how do I change my data? Well, It's hard to really visualize that. So something that's newer in JMP 15, JMP Pro rope 15 is this functional DOE analysis. And that's why I added that Octane rating to our supplementary variable. Alright, so I'm going to go down here, minimize some of these things a little bit Right. And down here on this FDE profiler, we've actually done some generalized regression. It's built in. So we're looking at that and this model as we look through these different wavelengths, we can see that that octane, what happens with the octane, we get to these different wavelengths, right, so that particular wavelength may be different for the different octane ratings, and that's what you're looking for, right. You want to see differences. Right. So, where, where can we see the biggest differences? Well, I don't know if you saw that happen out here, but right here on the end and the higher wavelengths, we're seeing some significant changes. So I'm going to go out here and as you can see that's bowed a little bit on the curve there and as I go back to the lower wavelengths, this curve starts to flatten out a little bit or actually gets a little steeper. It's not as flat as it was at the higher octane ratings. OK. So again, this is all about just again, investigating the data. But what we're going to do is go ahead and save those functional principal components. Right. And we'll do that through our function summaries right here. We need to customize that. I'm going to deselect all the summaries. And I'm going to put four in there because that's the number I want to save And just as a watch out, Make sure you say okay and save. Okay, it's fine. It just won't get you where you want to be. So we're going to say, okay, and save. And we get a new table with our functional principal components in there. The scores. Right. So all four of those for the different samples, for the different octane ratings. Now what we have to do is get that information back to our main data table. And you could do this through a virtual join. What I have done is actually I copy these and there's a way to do this fairly simply. So I need to actually go back over to my main data table. And if this will work for me is if I just, I don't need to keep this table. I just want to get these scores over to this table. And if this little... I get this All right, so you just grab them and drag them over to your main data table and drop them. I've already done it, so I'm not gonna drop them in there. But that's one way to get the data over there quickly. All right. Oh, let me minimize that for now. So this is the data. I mean that it's right there. Alright, so I've copied it over. So we've got the scores. Now we're going to do what I consider is the most important step, we're going to pick the samples that we're going to analyze. This is going to get you down to a much smaller number of samples to build your models on. Okay. And this is where we're going to use design of experiments to get us there. So we're going to select DOE, custom design. And we're going to Don't worry about the response. Right now we're going to add factors and we're going to add covariate factors. Right. And you'll see why in a minute why we're doing this. So I'm going to add covariants. Right, and you have, you have to select what your covariate factors are and we're going to choose the functional principal components. We're going to say, okay. And we're going to look in this functional principal components based to figure out which samples. We're going to analyze all right to build our model. Right. So I select continue. And right now it's saying I, you know, select all 60 and build that model from there. Well, we want to take that down to a much smaller number. We're going to use 15. Okay. So that allows us to again select smaller number. We don't have to have as many spectra. We don't have to run as many. But you have them and then you can select from them. So I'm going to say make design. And while this is building Alright, so we don't need this. I'm going to get rid of this. That's just some information. But what we've seen now is in our data table 15 rows have been selected, they're highlighted in blue. I'm going to go ahead and right click on that blue area and put some markers on them. Put star on. I'm actually going to color those as well. Right. So let's take those blue Okay. And Before I forget to I want to build what you do now is you take these and do a table subset. Right, so table subset. We've got selected rows, all columns, say, okay, and this is where we're going to be doing our modeling. But before I go there, let's go back to our data table and main data table and go to Analyze, multivariate methods, multivariate, right. And then use instead of using the wavelengths, I'm going to use the functional principal components. Put those in our y row, say okay, and now look at the, you know, what we saw before, we had almost complete correlate...complete correlation for a lot of the wavelengths. We've taken that out of play. And if you're looking at the space now, the markers as you see them -- the stars. We're looking at pushing things out to the corners of our four dimensional space, but we're also looking through the center of the space as well. So, this is more or less a space filling design, but it's spreading the points out to a point where we're going to get, hopefully get a decent model out of it. Okay. So we've got that. And I need to pull up my data table again. Pull this one up. Okay, so these are the samples that we're ... again that we're interested in. We're going to build our model on. And I'm going to slide back over here to the beginning. So these are the rows that were selected. And now we're going to go to Analyze. Fit model. And we want...octane is what we're looking for, right. So that's our rating that we're looking for. And we're going to use all of our wavelengths And this is also the next thing I'm going to show you is a JMP pro feature. Where you select partial least squares and the fit model platform. This you can do partial least squares or you can do the same analysis and regular JMP, just so you know that, but we're setting it up here in case if you wanted to, you could actually look for interactions. We're not going to worry about that on this model. And select run And got to make a few modifications. Here you can choose which method you want, the NIPALS or the SIMPLS. The SIMPLS is probably a little more statistically rigorous but NIPALS works for our purposes. The validation method, we do want to do that. But we don't have very many samples. So we're going to use the leave one out. Okay, so each row will be it's... pulled out and used as the validation. Okay. So we're going to start and just say go. As you can see up here on the percent variation explain for the X and the Y, we're doing very well. The model is explaining quite a bit of the variation for both the X and the Y, 90 almost 100%. That's great but they're using nine latent factors. Remember, we only had four functional principal components. So let's see what happens when we go to that. Change to four. And select go and we do lose something in our model, but it's not bad, right, so we're still getting a decent overall fit. And that's where we're going to go. Alright, we're going to use that model instead of the more complicated model with the nine latent factors. So I'm actually going to remove this fit. OK, and then we're going to look at this four factor partial least squares fit. What we're looking for down here is that the data isn't spread out in some wild fashion. They are, you know, for the score plots, the data is somewhere close to around this, the fit line, and we're okay with that. And if we're looking at other parts of this, we've got, look again, we're looking at what's...how much of the data is being explained, what are the variations being explained and we're looking at 97% there, almost 99% here for Y, and that's good. Let's look at a couple of other things while we're here, And look at the percent variation plots, which gives us an idea of, you know, how are these things different or how are these spectra are different and we can see that latent factor one is explaining a fair amount of the differences but latent factor two is explaining the better, more important part of that. Alright, so that's where we're kind of dialing into; three and four are still part of the model but they're not as important. So something else we can look at are this variable importance plot. There is a value here. It's a value of .08, right here that dotted red line. If you wanted to do variable reduction, you could do that here. Alright, so you could actually lower the number of Wavelengths you're looking at here, but we're going to leave that as is, right. And the way to do that to actually make that change, to actually do the variable reduction would be through this variable importance table coefficients and centered and scale data, you could actually make a model based on the variables, the important factors. Right. And you can see this again, that dotted line is the cutoff line and a fair number of those wavelengths would be cut out of the model. Right. But again, we're going to keep that and we're going to go up here. We're going to go to the red hotspot. Go to save columns and save prediction formula. Okay. Alright, so let's save back to the data table. Going to minimize that. And we've got this formula out here, right. That's our new formula. And if I go to Analyze, Fit Y by X, go to octane. And we're going to grab that formula. Say okay. All right, and Great, we fit the model and our R squares are around .99. And that's really great. But the problem is, how does that work for the rest of the data? Well, I'll show you that in a minute. But before we get there, I want to show you a separate method or another opportunity and I'll show you the setup and I'll show you the model. You would do analyze fit model. We're going to do recall. And this time, instead of using partial least squares. We're going to use generalized regression. Select that. We've gotten normal distribution, again we can go...we can go ahead and select Run. Right. We're going to change a few things here. Instead of using lasso, we're going to use elastic net. And then under the advanced controls, we'll make a change there in a second. But this validation method, remember we used leave one out. So we'll change that. We're going to select the early stopping rule. And we're also going to, under the advanced controls, make this change here. So this, this is this kind of drives why I even use generalized regression at all. It helps make a simpler model, but if you blank that out, blank out that elastic net alpha and then run your model. If I click Go, it steps through lasso, elastic net, and retrogression all the steps, through all those, so it fits all those models or tries to fit all those models and then it gives you the best elastic net alpha. Well, doing that takes a little bit of time, okay, because you're building all those different models. I'll show you the outcome of that in this fit right here, which I had done earlier. So this would be the actual output that I got from that model, again leave one out. And this gave me 41 nonzero parameters. Right. If I show you the other model, that partial least squares model is 400 wavelengths. So we've basically reduced the number of active factors by a factor of 10, right, with this elastic net model, right. And we can look at the solution path and we can change things and we can actually reduce the number of factors we want or add more, but for the most part, we'll just leave the model as is. We would save this model back to our data table. I've already done that. And now let's compare those. That's this model right here, or the information right over here on the left. Passed it too fast. This highlighted column, right, so if I right click there and go to formula, right, so I can look at that and, again, these are the important wavelengths. Alright, so that's the important wavelinks for that model for predicting octane. If I get, if I look at the partial least squares model, I click Go to formula there. This is the partial least squares model. And again, all 400 wavelinks. So it's much more complicated, complicated model. And again, you're, you know, you're more than welcome to use it. It's actually a very good model. So there's no reason not to use it, but if you can build a simpler model, it's always a good thing. Alright, so we've got these formula in our new data table or subset table and we want to transfer those back to the original data table, right. So again, right click formula, copy this formula, right, and then you would go back to your data table, make a new formula column over here. Right click Go to formula and paste that formula in your data set. All right. Well, I've already done that to save some time. Okay. And we've got, I've got both models there. I've got the partial least squares model in there, and really, what we're going to come down to, is we're going to go to Analyze fit Y by X. And we're going to go to octane rating, right, and I've previously done the PLS analysis. Now this model was built with a...48 samples were the training set and 12 for the validation side. Alright, so that's there. I've got my generalized regression formula, and I've got my octane prediction formula. Actually, this is the other PLS approach right here. And this one. And we're going to add those, and we're gonna say okay and compare those. And now you can see in here where we're doing very well overall. The models are doing well. We're still doing about 97% for our generalized regression model, in the end, which is still good. The PLS model beats it out a little bit, but then, remember, that's a much more complicated model. And overall, you know, we've built this nice predictive model that we can share with others. And as you get new spectra entries, analyze new spectra, all you have to do is drop those wavelengths into your data table and see what the octane rating is. All right. So you've made that analysis, you've made that comparison. And if nothing changes, in the day or over a period, of course, of a couple of days with your calibration, this should be a good model. It should be a sustainable model for you. So with that, I believe, I'm going to go back to, well, let me show you one more thing. I'm going to go to another data set that I wanted to share with you. This is the... as I'm trying to find it...I go to my home window... And this is a mass spec study for, actually it's a prostate cancer study and there's some unusual data with that. Right. And I'll want to show you... there's a couple of different ways, but what I want to show you is, pull this in here. Right. So instead of...this is abnormal versus normal status and... Showing you the power of the tools for...let's go to Analyze and then all the process, model driven multivariate...before I go there, let me color on status. Alright, so we'll do that, we'll give them some markers here. Okay. We're gonna go to Analyze... quality and process, model driven multivariate control chart. All of our wavelengths. Right. It takes a second to output, but I've got all the, right now, looks like I've got all the normal data selected, right, so that's what you're seeing there, if I click there. The red circles are the abnormal data, and for the most part, we see that there's a lot more of those out of control, compared to the normal data, right. The nice thing about this is if I could pull up one of those, I can start seeing which portion of the data is different than what we're seeing with the so called in control or normal data, right, and... Oh, Back to that. There. Gonna show you something else. Go to...we want to monitor the process again or look at the process a little deeper. So let's go to the right hotspot, monitor the process and then we're going to go to the score plot, right. So now we can compare these two groups. Well, we have to do a little bit of selection here. So let's go back to the data table. Right click, say select matching cells. And we're gonna go back over here and that's all selected, so that we're going to make that abnormal group, our group A, right. Go back to the data table. And scroll down. Select normal. Select matching cells and now that's going to be our Group B so now we can compare those. And now we can see where there's differences in the spectra, like, so this is maybe on the more normal side that you won't see in the abnormal side, right. But you're gonna...there are a lot more differences that you're going to see in the abnormal side that you would not see in the normal side, right. So this allows you to, again, dig deeper and better understand that. And finally, if I do this analysis for the functional data explorer with this grouping... Again rows as functions. Right. Y output. Status is our supplementary variable. Sample IDs, ID function. Say okay. And we'll fit this again with a P-spline model. This will take a second. While we're waiting for this to happen, I'm just going to show you, at the end, the generalized regression portion of this will be done, but I just want to show you what it's like looking at a categorical data set with the functional data explorer. Using that functional DOE capability. It ends up being, could be very valuable. And when you're looking for differences in spectra. And again, this is mass spec data. This isn't your IR data. This is mass spec data. We fit it, we've looked at our different spetra, how it's fit and we're happy with that. We can look at the functional principal components. Can look at the score plots. Let's look at the functional DOE. And again, where do we see differences? If I go over here and we're looking at abnormal spectra. It doesn't have this peak that the normal does, right, so now we can look at that and see, you know, again, help us better understand what differences we might see. All right. And in closing, let's go to back to the PowerPoint. Alright so what this process allows you to do is compress the available information with respect to wavelengths or mass or whatever it happens to be. Use this covariate DOE to help you select the so called corners of the box for getting a good representative sample of data to analyze. Model that data with a partial least squares, generalized regression. You can also use more sophisticated techniques like neural nets. And as new spectra comes in, you put the data into the data table and you see where it falls out. So this is highly efficient or helps you be more highly efficient with your experimentation and your analysis. And again, build that sustainable empirical model. Looking forward, the data that I've used is fairly clean and we're looking at working with the our developers and looking at how we can preprocess the spectral data and get even better analysis and better predictive models.  
Christian Stopp, JMP Systems Engineer, SAS Don McCormack, Principal Systems Engineer, SAS   Generations of fans have argued as to who the best Major League Baseball (MLB) players have been and why, oft citing whichever performance measures best supported their case. Whether the measures were statistics of a particular season (e.g., most home runs) or cumulative of their career (e.g., lifetime batting average), such statistics do not fully relate a player’s performance trajectory. As the arguments progress, it would be beneficial to capture the inherent growth and decay of player performance over one’s career and distill that information with minimal loss. JMP’s Functional Data Explorer (FDE) has opened doors to new ways of analyzing series data and capturing ‘traces’ associated with these functional data for analysis. We will explore FDE’S application in examining player career performance based on historical MLB data. With the derived scores we will see how well we can predict whether a player deserves their plaque in the Hall of Fame…or is deserving and has been overlooked, as well as compare these predictions with those based solely on the statistics of yore. We’ll confirm Ted Williams really was the greatest MLB hitter of all time. What, you disagree?! Must be a Yankees fan…     Auto-generated transcript...   Speaker Transcript Christian So thank you, folks, for joining us here today at the JMP Discovery Summit, the virtual version. My name is Christian Stopp. I am a JMP systems engineer. And I'm joined today by my colleague Don McCormack, who's a principal systems engineer for JMP as well. And you probably got here because you saw the title of the talk. And you saw this was...you're a baseball fan about Major League Baseball players and wanted in or you saw it was about functional data explorer and you wanted to learn a little bit more about how to employ functional data explorer in different environments. So we're going to marry those two topics today. Don and I and I'm going to gear my conversation a little more for the baseball fans first. Just as we're having kind of common conversations among baseball players and baseball fans, you might think about how your favorite player does relative to other players and you might have with your friends, these conversations and hopefully they're kept, you know, polite about about who your favorite player is and why. And so that's kind of how I imagined this infamous conversation between Alex Rodriguez and Varitek going was just about who...comparing notes about who their favorite player was. And so for me, my origin started off, and like Don's, with respect to just be having a love for baseball and being interested in the baseball statistics that you'll find in the back of the bubble gum cards we used to collect. And so as you have these conversations about who your favorite player is, you might note that players differ with respect to how good they are, but also different things like when they age... as they age, where they peak, like where the performance starts to go off over time. And so as you're thinking about maybe like me the career trajectories of these players, you might want to question, Well, how do I capture or model that performance over time? Now, if you're oddly like me, you decide that you want to pursue statistics so that you can do exactly that. But I would encourage you to skip that route and be smarter than me and just use a tool like functional data explorer to help you turn those statistics...statistical curves into numbers to use for your endeavors. So for those of you who are a little less familiar with baseball, but what we'll be seeing is data reflecting things that are measures of baseball performance. So I'm going to be speaking about position players and position players bat. And so one of the metrics of their batting prowess is on-base percentage plus slugging percentage or OPS. And so on the Y axis, I've got that that measure for a couple of different players as they age. And the blue is Babe Ruth and the red is Ted Williams. And as you can see, you get a sense from these trajectories that they both appear to have about the same quality of performance over most of their careers. But you might know that where they peak might seem to be a little at an older age for Ted Williams, as opposed to maybe Babe Ruth. And Babe Ruth, it looks like he maybe needed just a little bit of time to just get up to speed to get to that measure if you're just looking at this plot without any other knowledge. So there's a lot of...this is just two players in the thousands of players or tens of thousands that you might be considering and just look at comparing, you can imagine there's a lot of variability about these characteristics of their career trajectories. So there's also clearly variability within a player's trajectory, too. So I might use the smoothing function of the Graph Builder here and just smooth out the noise associated with those curves a little bit, to get a better sense of the signal about that player's trajectory. And it turns out that that smoothing is is very similar to what's going on in that process that functional data explorer employs. So here I've got functional data explorer and again I'm...my metric here is on-base percentage plus slugging percentage, OPS. And I'm looking just to see...like we're comparing these these player trajectories, now, in FDE is, functional data explorer, is smoothing out those player curves, as you can see, and then extracting information about what's common across those curves. And so for every player now, what you get in return for doing that is, are scores that are associated with that player's performance. And so these scores describe the player's career trajectory in a nice little quantitative way for us to take away and use another analyses like we'll be doing. So it's just, you can see that a little bit, these are Hank Aaron scores. And in the profiler that you'll...that you can access in the functional data explorer, you can actually change...you can look at that trajectory here for that player's OPS over age and then change those values to reflect what that player's scores are and get a better...replicate their their career trajectory with those scores. Right, so that's a little bit about FDE and and how to employ it here. So you'll see Don and I talking about these statistics that we're now equipped with, these player scores that we get out from the functional data explorer, that gets it from those curves that we started off with. And so we're going to use that...some what we're doing is predicting like maybe Hall of Fame status. And not only who's in the Hall of Fame that they belong there, or more more interestingly, like maybe, who are the players who are in the Hall of Fame that maybe shouldn't be because the stats don't support it or maybe identify players who the Hall of Fame committee seems to have snubbed. So we'll talk a little bit about just the different metrics that we used and how we kind of revised them. And then taking those career trajectories using FDE and then getting the scores out and doing the prediction, like we normally would with other things. So if you haven't followed baseball, the Hall of Fame eligibility...eligibility requirements are that a player had to play at least 10 years, so 10 seasons, and had to wait...you have to wait five years before you're eligible. And then you have 10 years during which you're eligible and folks can vote you in. So there's a couple of players we'll see that are still have...that are still waiting for the call. The hall uses a different selection criteria are primarily around how well the player performed, but also take into account these other things that the data source we're using, Lahman Database, doesn't include, so it's hard to measure. So we're just stick with analyses that reflect their statistical prowess on the on the field. And of course after, you know, 150 years of baseball players playing baseball, you might recognize that they're playing in different eras. And so we want to make sure that we're comparing the players to their peers. And so we're going to take that, you know, maybe the year that they played into account, or the position that they played since different requirements are associated...would typically be associated with different positions. And then different leagues have different rules; we'll weigh that in, too. That's where I'm gonna stop. Don's gonna kick over to pitching and then I'll come back and talk about position players. donmccormack So like Christian said, I'm going to talk a little bit about pitching but while I'm doing that, before that I'm doing that, what I would like to do is, I would like to illustrate some of those initial points that that Christian mentioned. The things that are good data analytic techniques, things that really need to be done, regardless of what modeling technique that you use, however, it turns out that they are good things to do before you model your data using FDE. I'm going to talk specifically about cleaning the data, about normalizing the data, so you can compare people equally, and then finally modeling the data. So as an illustration, what I've got...what you see on the screen right now, we are looking at three very different pitchers that are all in the Hall of Fame. The red line is Nolan Ryan, a very long career, about a 27-year career. The green line, the middle line, that's Hoyt Wilhelm. For some of you younger folks, you might not know who Hoyt Wilhelm; he pitched starting in the early 50s through 72, I believe. Fairly long career; spanned multiple eras. He was mostly a reliever but not a reliever like you might know of the relievers today. He's a guy who when he went out to relieve, yYou know, he might pitch six innings. Okay, so very, very atypical from the relievers today. The blue line is Trevor Hoffman, great closer for the San Diego Padres. But again, very different pitcher. So question is, I mean, what do we do, how do we get this data ready and set up in such a way where we can compare all three of these people equally? So first thing I mentioned is we want to clean up your data. And by the way, I'm going to use four different metrics. I'm going to use WHIP (walks and hits per innings pitched), strikeouts per nine, home runs per nine and a metric I've easily created called percent batters faced over the minimum, where I've just taken the number of batters a pitcher's faced divided by the total outs that they've gotten and subtracted one. The idea here is that if every batter that was faced made it out, then that would be a perfect one. Okay, I'm going to look at those four metrics. I've got different criteria in terms of how I define my normalization, in terms of how I am screening outliers and I'm going to include a PowerPoint deck for you to look at to get the details, but I'm not going to talk about them here for the sake of time. So first thing I'm gonna do is going to clean up the data. So you'll notice that, for example, that very first year Nolan Ryan pitched three innings pitched; very, very high WHIP. As a couple of seasons in here, I think that Trevor Hoffman pitched a low amount. So, so I'm going to start by excluding the data. That's nice. It's shrunk the range and it's always good to get out, get the outliers out of the data before you do the analysis. One other step that that I want to mention is that when I did FDE, when I used FDE on this data within the platform, it allows you to do some additional outlier screening where, even if you have multiple of columns that you're using, you only are screen...you're not screening out the entire row; you're only screening out the values for that given input, which is a very, very nice feature. So I use that as well because there were still, even with the my initial screening, there was still a few anomalies that I needed to get rid of. clean the data. Normalize it is the second. So by normalization, what I've done is, I basically normalized on the X axis. And I've normalized on the Y axis. So, what we're looking at here is the number of seasons. So each one of these seasons is taken as a separate whole entity, but we all know that in some seasons, some pitchers throw more innings than other seasons. So rather than looking at seasons as my entities, I'm going to look at the cumulative percent of career outs. So I know that, I know that at the end of the season pitchers made so many cumulative career outs, and that's a certain proportion out there, whole or total career outs. So I'm going to use that to scale my data. Now the great thing about that is, you'll notice that now all three pitchers are on the same x scale. Everything, everything is scaled from zero to one. So, so, really nice... from the standpoint of FDE analysis, a really nice thing to have. And then finally, I want to scale on the Y axis as well. And all I've done is I've divided the WHIP by the average WHIP for the pitcher type and for the era that they pitched in. So I have a relative WHIP. Now the other nice thing about about using these relative values is that I know where my line in the sand is. I know that a pitcher that has a relative WHIP of one is is an average pitcher. So in this case, I'm going to be looking for those guys that throw with WHIPs under one. So you'll notice that all three of these pitchers for the most of their career, they were under that that line at one. Now the final thing I'm going to do, is I want to use my FDE to model that trajectory, the trajectory. Now, one of the problems with using the data as is, the two problems with using the data, as is. One is that it's pretty bumpy, and it would be really hard to to estimate what the career trajectory is with all of these ups and downs. Second thing is, eventually what I want to do is, I want to use that metric that I've generated from the FDE, this trajectory to come up with some overall career estimate. So rather than looking at my seasons or at my cumulative percent as discrete entities, I want to be able to model that over entire continuous career. And we'll see that a little bit later on. So I am going to replace my percent my...I'm sorry about...conditional...my WHIP, my my relative WHIP with this conditional FDE estimate. Now, you might have seen me flip back to those two, you might say, oh boy, that is what a...what a what a huge difference between the two, is that really doing a good job? Kind of hard to tell from that graph. So, so what I, what I want to do is I'm going to actually show you what that looks like. So here what I've done is I pulled up the, the, the, the discrete values. This is Nolan Ryan, by the way. The discrete measurements for Nolan Ryan, along with his curve for his for his conditional FDE estimate, you'll see that it doesn't follow the same jagged path or bumpy path, but it does a good job estimating what his career trajectory is. And in general with his WHIP high, at first, he walked a lot of people, was a very, very wild pitcher, much more wild in the beginning part of his career, believe it or not. But as his career went on, that got better. And this is you'll see this in in any of the pitchers that I that I picked. So for example, if I go to, let's go to Hoyt Wilhelm. Here's Wilhelm. Again it doesn't capture the absolute highs and lows, but it does a good job at modeling the general direction of where, of where his career went. Okay, so let's let's use that to ask. I only have a limited amount of time. I wish I had more time because there's just some neat things I can show you. But I'm...I'm going to start with what I call the snubbed. Okay so these are the players that...so I used FDE on those four metrics I'd mentioned. I use those as inputs, along with the pitcher type and I tried a whole bunch of predictive modeling techniques. The two that that worked the best for me were naive Bayes and discriminate analysis. And I use those two modeling techniques to tell me who got in...who should be in and and and and who shouldn't be in and and that's what...what we're looking at here is, we're looking at those pitchers where both the naive Bayes and the discriminate analysis said yes, but the Hall of Fame said no. So these are my...this snubbed. So you'll notice that in this case...and let me switch to this. This is the apps. This is the relative WHIP. Let's go with the conditional WHIP. And let me go ahead and put that reference line back in there at one and you'll see, for the most part, these are pitchers, who spent the top...the bulk of their career under that one line. Now the other thing that you might might think of, looking at this data, is that wow, it would be really hard to tell these players apart. How do I compare these now, if I if I were to put, let's say, a few pitchers that were in the hall in this list, too. I mean, they would be...it'd be hard to separate them just by eyeballing them, because some of their career, they would be better than others, and they would switch on other parts of their careers. How do I, how do I deal with this on a career level? So as I mentioned earlier, one of the nice things about functional data explorer is that I can take that data, and I can I can I can create a career trajectory. Estimate a whole bunch of data points along that career trajectory. And I did that I actually broke up careers into 100 units and I summed over all those hundred units for each one of my curves. So basically, what I did is I got something like an area under the curve. If it were above that one line, I'd subtract, if it were below that one line, I would...I'm sorry...if I were above that line, I would add; below the line I would subtract. And if we look at total career trajectories...this is a, this is actually...this is a larger list. This is approximately 1300 or 1400 pitchers, so absolutely everyone who was... absolutely everyone who was Hall eligible, 10 years or more. So let's let's really quickly go into a couple of things we can do with this. Let's start...let me start out by by looking at the players that were snubbed. So these are...these are our player...this these are my players that were snubbed. So okay, so these are 100 values. So, so, so the line in the sand here would be 100 because I've got 100 different values I've measured. So you'll notice that for the most part, these players were above 100. Here's the list of, of the, of the players that didn't make the list. And if you take a look at these players, you'll notice that there are a couple of guys in here that are obvious. People like Curt Schilling and the and the and the Roger Clemens for non non non career reasons for the for the for the for the for the...that some of the other criteria that Christian mentioned,are not in there. But there's some guys, for example, Clayton Kershaw who's still not done with his career. But there certainly are other people who you might consider..that are Hall eligible. So let's actually, let's look at that, too. So let's look at those folks who are who are Hall eligible, who have not been in the hall... BJ Ryan; again Curt Schilling is in there; Johan Santana, not sure why he didn't make it in the hall; Smokey Joe Wood, pitcher from the early part of the 1900s; and so on. So, the ability for FDE to allow me to extract values from anywhere along their career trajectory is is is an extra tool for me to be able to estimate some additional criteria, in terms of who belongs in the hall and who doesn't belong in the hall. So, enough said about the pitchers. Let's...I'm gonna turn it back to Christian so we can talk a little bit more about the position players. Christian Excellent. Thank you, Don. Right. Okay so Don was talking about the pitching...the pitchers and so I'm looking, I'll be looking at the position donmccormack players, and so there's two different components that go into that. Christian You have your, your batting prowess, as well as your fielding prowess and I took a little different take than than Don did, with respect to just looking at the statistics and then building models. I ended up starting off with just four of the more common batting statistics, and those are the first four on the list here, some of what you'd find the backs of baseball cards. And then as I was progressing, as we'll see, I needed something to capture stolen bases, because the first four don't really...don't do that at all. And so I created a metric I call the base unit average that brings into other base runner movements that...to give credit to the batter for those things. And then the fielding, of course, is a factor as well as we'll see, so I included a couple of metrics for fielding. And so like Don like just mentioned earlier, I wanted to make sure I compared apples to apples, so I'm looking at with reference to position and league and year for this those statistics I mentioned. And then when like Don, I wanted to make sure I I weighted those smaller sample sizes appropriately so they weren't gumming up the system. And so I ended up weighting players' performance relative to the number of plate appearances relative to kind of the average for that league year at a particular lineup slot on how many plate appearances that slot should get over the course of the season. So that's how they're weighted. Right. So let's, let's see what that looks like. We're going to go back and visit Ted Williams again here. So we've got Tim Williams' career on the left here, we saw, and these are the raw scores. And then it looks like he had a really poor season here. But if, once you take a relative component of that, you can see it's actually an average season like Don, it's still above that average line of one. And so it was just a kind of a poor season for Ted Williams on his own standards. And then we saw earlier that these two peaks for Ted Williams might have resembled were his peak performance, but it turns out that those are seasons where he had smaller numbers of samples...of played appearances due to his being...going off to the Korean War. So he ended up having that impact his scores. I weighted accordingly back toward the average again because of the smaller sample sizes. So, that's how we, the types of data. I'm going to focus on just the relative statistics in my conversation here and just focus on some of the things that caught my eye. There we go. And we'll do that. Need the table of numbers here to feed in from the FDE. So here's the scores that we're going to be looking at, the relative FPC scores from the FDE. And what the first thing I saw, I included a four variables in my model, those first four batting statistics, and I wanted to just make sure I had the right components in my, in my analysis. So on the left axis here is the model-driven probability of being in the Hall of Fame. Now what...excuse me, that's the y axis, on the x axis is whether or not the person actually is in the Hall of Fame. And so my misclassification areas are these two sections here. And I noted that there were some players down here more than I was kind of expecting. So I was exploring and we might explore variables that I didn't yet include, like stolen bases. And so I'll pop those in for color and size. And as you can see, it seemed pretty clear to me that stolen bases is definitely a factor that the Hall of Fame voters were taking into account. These are... so the color and size are relative to the number of stolen bases over their career. And this is what drove me to create that base unit average statistic that I then used. So adding in...as I was exploring those models I as I described, I started off with four statistics and then added in that BUA statistic. This is my x axis now. And then I added in fielding statistics and we what we have here is a parallel plot, where the y axis is again...is a probability of the model suggesting the player should be in the Hall of Fame and each of the lines now is a player. And so the color represents their Hall of Fame status. Red is yes, there were already admitted, and blue is no. And so I like this plot because it allows me to look at to see who's moving. If I can see the impact of those additional variables in the model. And of course the first thing that caught my eye was this guy here, that how it popped up from being a not really to adding the stolen base component, and we can see that he's a high probability being elected to Hall of Fame and so belonging, depending on how you look at it. And it's Ricky Henderson, who happens to be the career leader of stolen bases. Now another player, and just looking at the defensive side of things, is Kirby Puckett, whose initial statistics suggest that, based on the initial model, that he makes it; he's qualified sufficiently just across the line. But then, you know, back if you add in the stolen base component, yeah, he actually doesn't seem to qualify any longer. And then finally, we put in the fact that he's, he's a really good fielder, he won a number of golden gloves playing the center field for the twins, we see that he's back in the good graces of the Hall of Fame committee, and rightfully voted in. This is kind of a messy model. Not messy model and you did a lot of stuff going on here. So, I ended up adding in my local data filter so I could kind of look at each position individually. And here for first base, it's a lot easier to see that we have the, the folks in red and then in blue. Now we've got somebody here, this is Todd Helton who, at least in all the models that we were looking at suggest that he should be admitted to Hall of Fame and he's still eligible. So he's still waiting to the call. But someone like Dick Allen, there's also blue, not in. His numbers, at least based on the the summary stats, the FDE statistics that we're using and the models suggest he shouldn't...he belongs in the Hall. And there are other folks who are red, down in the bottom, like Jim Thome, who the models suggest he doesn't really belong, but he was voted in. So, different ways of exploring those different relationships among, as we add in those predictors. Now, like Don, I wanted to get a sense of, well, who's, who was snubbed and who might have been gifted or at least had, you know, non statistically oriented components to his consideration. And so I, like Don, running a number of models and settled on four models that I was, I liked and did the best job of... predictive job, and like Don, rather than just using age in my FDE as the x axis, I also based it on a cumulative percent played appearances. And so that would...having these two different variants gave me a number of models to look at. And so I drilled down to just the folks who across all eight models, are in the Hall of Fame, but none of the models suggest they should be. And that's this line here. There's 31 of those. And the reverse side I have in green here, the folks who the models in either of the buckets...the majority the models in either bucket of age versus Kimball diff percent of plate appearances suggest they do belong in the Hall of Fame, but they're not. So I pulled all these folks out and just like Don wanted to, just compare what what are their trajectories look like and is there...are they close at least, or is there something else going on here? And so you can see from the this is the on on base percentage plus slugging percentage, OPS, again. It certainly looks like, in red and the plus signs, that the folks who were snubbed performed a lot better on this metric, and as it turns out, every other offensive stat metric better than the gifted folks, the folks who are in, but the model suggests shouldn't be. And that made me think, Well, is it, is it just the offensive stats that are and maybe the fielding is where the, the, the folks who were in already shine? And based on what at least fielding percentage, it actually suggests that there that still is the case, where... actually this is this snubbed folks. The, the gifted folks still look like they were... they don't necessarily belong as much as the these snubbed folks do. It was only on the range factor component where the tide reversed. And so you end up seeing the gifted folks outweigh the snubbed folks who performed better. That's another different take, much like Don's, that you can use to evaluate just what the components are included in your model. A lot of different ways we can look at the data here. So just wrapping up because I'm sure some of you are just burning to know who is snubbed and who is gifted among those folks. These are some of the folks that were snubbed, at least among the position players and, like Don mentioned for some of his pitchers, there's a few of these folks who are banned from baseball, so they're not exactly snubbed, so. you probably recognize some of these. And then these are some of the players who were gifted, or at least it the criteria of their statistics alone is...it may not have been what got them in the Hall of Fame. Right, so just wrapping up where we've been, we've been able to take those player career trajectories of their performance on...pick a metric and put that into the functional data explorer and get out numerical summaries that capture the essence of those curves. And then, in turn, use those statistics those scores that we get to be able to put those in our traditional statistic techniques that we're familiar with. And so now we can change that question from how you model or quantify career trajectory and revise it to a question of what do I want to explore with these FPC scores I've got? So we hope you enjoyed talking about baseball and just that interaction to baseball and JMP and FDE. And hope you feel empowered to go and take the FDE tool that's available in JMP Pro to address questions with data like who your favorite player is and why, and have the means of backing it up. Thanks for joining us. Take care. donmccormack Okay, so how do we deal with these cases where we need to look at somebody's career trajectory? Are there other metrics where we can make these comparisons, so that we could tell these really fine gradations apart? So as I as I alluded to earlier, what we could do is we could we could certainly we could we could look at absolutely any point along the along the person's career trajectory with any amount of gradation that we want to. And I did that. I took 100 data points, 100 values between zero and one, start of the career, end of the career, and I summed up over all those values. And I did this...the nice thing about this technique is that I can do it for multiple metrics. So, so now what we're looking at here is we are looking at, we're looking at a plot of all four metrics. We can plot them all on one graph. We're going to go back again to that group of folks that were that were snubbed, these folks here. So that's so...so if we take a look at these folks, we see that they had a low...by the way, 100 in this case because there were 100 observations. hundred home runs per nine, you want that low; percent batters faced over the minimum, low; and then the strikeouts over nine innings, you want on the high side. You'll notice that that's kind of the trajectory that folks follow. Now then, the interesting thing about this point is, that what I can do is, I can use any criteria that I want to. So for example, let's say I'm going to look at...I'm going to consider all my players and I only want to consider those people who had A WHIP that was below, in this case, 100...so better than that...that's actually that's...even make it better than that. Let's say 90 or below. Okay, so let's look at those folks who, you know, at least have the average number of strikeouts per nine innings, and maybe their batters per...percent batters faced over 100 is at a minimum. And so, and I'll disregard home runs for nine here. I also, you could also standardize and normalize by the number of seasons and I've done that exactly. So what I want to do is I want to look at those players that maybe only have 10 season equivalents, where a season equivalent is based on what was the average player season like. All right. And then finally, what kind of workload they had over their, their entire career. And let's say we want somebody who had at least 80%, let's make a little bit more stricter, let's say, let's say, about the same workload. And again, we can use different criteria to weed out those folks who we don't think we should consider and those folks who we do think we consider and then using those criteria... I also want to say let's let's take a look at those folks that are not in the Hall of Fame. So here we go. Now we have a list of people who are worth considering. And you'll notice that they're they're quite a few folks folks that probably shouldn't surprise you. These are folks that are either not in the hall yet because they're still playing or just have been disregarded know, Chris Sale, for example, is still pitching. Curt Schilling, for obvious reasons is not the hall. Johan Santana, why, why isn't he in the hall? He was actually part of that group that that that were snubbed. So the nice thing about using these FDEs is that you can take them, turn them into your career trajectories, and then use an additional metric to be able to determine hall worthiness and non Hall worthiness.  
Monday, October 12, 2020
Melissa Reed, MS Business Analytics and Data Science, Oklahoma State University   This project is about Early Presidential Primaries and how the results from those primaries affect who wins the Presidency. This research will focus on the Presidential Primaries where a new President was elected, so that would be the elections of 1992, 2000, 2008, and 2016. The elections in 2000, 2008, and 2016 will be focused on because no incumbent running, however, the election of 1992 Bill Clinton defeated the current President George H. W. Bush to win the Presidency. The election of 1992 will be focused on because George H. W. Bush is the only President that did not get re-elected since the Cold War ended in 1991. The specific primaries that will be focused on are the Iowa Caucus, the New Hampshire Primary, and Super Tuesday, because they are the primaries that help predict the rest of the country’s primaries since they are early in the election cycle. The hypothesis for this research is that the candidate that wins most of the Early Presidential Primaries wins the Candidacy and the Presidency. JMP software will be used to test the hypothesis.  The research concluded that the person who wins the most primaries, will most likely win the Party Candidate but will not always win the Presidency.     Auto-generated transcript...   Speaker Transcript melissareed Hello, my name is Melissa Reed and I will be presenting about my poster, and it is about the early presidential primaries. I am from Oklahoma State University. So a little bit of background about the early presidential primary is that a lot of people aspire to be the President of the United States and not actually...not a lot of people actually run for it. And the campaigns usually start about two years before the November election, but a lot of campaigns do not make it to the Republican and Democratic National Conventions for a number of reasons, because they didn't either get enough votes or a lot of time, they run out of money beforehand. The early presidential primaries that this poster focuses on are the Iowa caucus, the New Hampshire primary and Super Tuesday. The reason that these were chosen is because they are three early primaries and typically, the way that these go, the rest of the country will follow. And they are just really important. So the hypothesis for this project is the person who wins the most votes during the early presidential primaries will more than likely win the Democratic or Republican candidacy for the President of the United States. Looking at the elections of 1992, 2000, 2008, and 2016, they were focused on because a new president won the President... won the Office of the President of the United States. 1992 is focused on because because President Bill Clinton defeated the current president George HW Bush, and George HW Bush was the first president since the Cold War ended not to be reelected. 200, 2008, and 2016 are focused on because there were no incumbents running. You can see on the poster that in 1992 and 2000 and 2016, the candidate that won the Democratic and Republican candidacy for the President of the United States were the two people that had the most votes out of those three early presidential primaries. However, in 2008 Barack Obama and Hillary Clinton were the top two candidates that got the most amount of votes, but because they are both Democrats they could not both get the candidacy, and the Republicans named John McCain. So to do the analysis, I used to JMP to run a correlation analysis and a logistic regression. I ran the coordination analysis between the year and how many votes were cast to see if there was a connection between them and the correlation analysis to prove that the year, there's a connection between the year and the amount of votes that were cast. I ran the logistic regression between the candidate, the year and a state primaries to see who was most like...most likely candidate was to beat the other candidate. The results of those elections...the results of that regression are down below in the result sections. Now in 1992 the New Hampshire primary was the one that I focused on because George HW Bush did not have anyone running against him in the Iowa Caucus, so so I chose New Hampshire for that one. And the rest of the elections from 1992, 2000, 2008 and 2016, the logistic regression showed that the person who is most likely the win isn't the candidate who's actually the person to get the most votes. In 2008, Barack Obama was shown to win some elections against him and Hillary Clinton, but not against everyone else. In conclusion, the person that wins the most votes the Iowa Caucus, the New Hampshire primary and Super Tuesdays will most likely win the candidacy for the Democratic and the Republican and they will run for the President of United States. Now in 2008 there was a difference because the two people that won the most votes in those primaries were two Democrats. Thank you so much.