Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Wayne Levin, President, Predictum Farhan Mansoor, Software Engineer, Predictum     Innovation in industry requires the contributions of analytical knowledge – more specifically formal and informal experimental data and predictive models – to product and process design. However, analytical knowledge is often stored and unmanaged in isolated sources for individual use only by its creators. By our estimation, analytical knowledge is typically regenerated on average about 40 percent of the time, simply because prior, relevant knowledge was not made accessible to the people who could make effective use of it. Regenerating analytical knowledge carries higher risks, incurs unnecessary costs and delays achieving business goals. The future success of technical problem solving and process innovation requires a modern knowledge management strategy. Companies that adopt a modern knowledge management strategy will dramatically reduce the amount of time and effort in the daily work of engineers and scientists, thereby not only preventing the needless duplication of experiments, but also extending the use of past experiments and improvement initiatives. Wayne and Farhan will present a use case to demonstrate the power of managing analytical knowledge through CoBase, an enterprise-level knowledge management system that enables engineers and scientists to share access to and collaborate mutually on past analyses and relevant supporting data.     Auto-generated transcript...   Speaker Transcript Wayne Levin Well, thanks, very much for joining us. My name is Wayne Levin and joining me in our presentation about knowledge management for faster problem solving and reduced time to market in engineering and science is my friend and colleague, Farhan Mansoor. So Farhan is going to help me with the demonstration part of this and we're going to start out with...why don't we start with a little agenda here. I'm just going to give just a quick introduction to us...we're Predictum, as a company, just so you know a little bit about us. And the real focus is going to be how to improve productivity in science and engineering. That's what we're... that's what we're here to talk about today. So a little bit about us. Our goal is to accelerate problem solving, improvement, and research and development. And we do that through analytical training and consulting and we build integrated analytical systems. And we're going to look at an example of that today, with CoBase. Just so you know, we're also a JMP partner, have been for a long time now, been associated with JMP for close to 25 years. Predictum as a company started in March 1992, so almost 29 years. So that's a little bit about us, now let's get to the matter at hand. I want to talk about how managing knowledge, not just data, as an asset, so managing knowledge as an asset will dramatically improve productivity among your researchers, engineers, and business analysts. And so I'm going to start by asking you a question. I'm how often does someone regenerate what was already known in your company? If you have to put that on a scale between zero and 100, what would you say? I'd like just like you to plant a number in your your head, okay? There's no need to confess. Alright, so it's some problem that was solved, but it was solved before. Others have dealt with that problem or some insight, some relationship or association between variables. Chances are others have probably made that discovery already. That's what I mean by regenerating knowledge. Do you have a number in mind? Well we've been asking companies, dozens of companies now since the fall of last year, and this is what we've seen. It's on average about 40% and a pretty big range as far as that goes. So nobody has really a solid number for that, and I think that actually speaks to the problem, that we don't have a solid answer. But you know when thinking about it, you know, we have some people who just roll their eyes and they go, all the time. We had one company, a few weeks ago, three people were on the call. One said 80%, one said 60%, one said, oh, at least 50. So, and we do get some who are...the lowest I've ever heard is like 20%. Why do we tolerate that? And what other...what other aspect of business would we tolerate something like that? So you know the...regenerating what was already known imposes higher risks, it costs money, it delays objectives and it's really a lost opportunity to accelerate problem solving, improvement and R&D. So this is the problem...I want to make sure we're clear, this is a problem we're trying to tackle here. Now you probably know that expression. I'm not going to fill in the blank. I'll leave it to you. Blank happens. So you probably experienced these situations, the phone rings, an email comes in, there's some customer issue or there's some production problem, or some obstacle has surfaced and it's causing delays in a new product introduction or or process development and characterization. And you're frantic, right? You've got to take care of this, and you're thinking, okay, who knows how to solve this? Who's got that knowledge? And you're...basically you're trying to identify the right people and that can be difficult, sometimes it's hard to to do that. Or you can identify some good candidates and they're not available or they've retired or they're away, they're on vacation or they've been reassigned and you're not allowed to talk to them anymore. They're simply gone. They're just no longer with the organization. How do you handle that? This is...this is a problem we want to avoid, and this is why we say companies typically manage their materials and spare parts better than they do their knowledge. I hope that doesn't sound too harsh, but I'd like you to think about that for a moment. I was asking you earlier about how much knowledge is regenerated. And you probably don't have a hard number for that, but I'll bet you, you probably have a pretty good number on what your work in progress inventory is, or spare parts inventory, raw materials inventory. You can probably find somebody and ??? So why is it that we manage those assets very well, but we don't manage knowledge very well. And that's because knowledge is... well, this is, this is where it's at, right? The brain, we like to say, is a great knowledge creator but it's a lousy knowledge container. We can't access it. We can't index it. We...you know, it walks out the door at the end of the day. So knowledge cannot accumulate if the brain is the primary storage device for knowledge. And we believe that companies should always accumulate knowledge, so if you have...and I'll just look at just what's involved in knowledge creation, if you will. I'll do that. And first we're going to just talk about what is knowledge? What do we mean by knowledge? And so, knowledge is what allows us to predict, at its very core, we're talking about a prediction formula, because we use a prediction formula to predict. And so we, you know, we can associate height to weight or height, age, and sex or whatever to to weight, and all of us who use JMP are familiar with these things. So that's the... that's the third component, if you will, predictive model, a prediction formula, if you will. But when we have that, we would like to know something about the analytical method or the process that was used to generate that prediction formula and, of course, those of us, you know, with JMP, we have those over here, right? We save them as scripts, if you will, so that we can regenerate that formula. Of course, the formula would be saved as a...as a column to the data table and then that's one of the things we do with the analysis. So that's what's happening over here with number two and, of course, the primary thing, the first point is is just data. But as I said at the beginning, data is not knowledge. You know, data is data. It's the raw material that we get from instrumentation, helps us understand what's going on with products or with processes, so we really need these three components. So knowledge is what allows us to predict, but we...we would like to go back a couple of steps as well, to really have the full context of that knowledge. So let's think about that knowledge creation here. You know, we have engineers, scientists, analysts. I'm going to think of it from a scientist's point of view. They are, you know, using their instruments, they're doing their work, they're collecting their data. Now we find all too often it's kept on spreadsheets, but of course we're... we're talking JMP data tables as well, and that's that's a terrific thing. They collect it, they do their analyses, and this is why we call it personal computing. And that picture of the computer there, it's just meant to remind us, personal computing has been with us a long time and it's relevant. It's important. It's still the case. Analysis does happen in a brain. It's a personal endeavor, if you will, that really can't be split up so just at its core. So we've got a bunch of people who are doing this and as a result, this work is typically siloed. All right, where do these files go? Some people will save them on Sharepoint or shared network drives. That's great, but primarily they end up on a laptop, and so it is inherently siloed, just because that's the nature of it. And researchers, you know, we can't easily access the experience of others. I mean, if we've got a problem we want to solve, we're thinking about it, you know, as an experiment, we might call together a group of, let's say, five people in a room for, you know, for an hour, or 10 people in a room for an hour and brainstorm, you know, what factors, what levels to go to, and all that. It's a good and worthy activity but wouldn't it be better not to start from zero and not to occupy those people? That's a good amount of time that's that's being used. So what we'd like to do is be able to take the experience, if you will, saved in these JMP data tables and related files, and Farhan's going to show us this, make it easy to put it in a database. Right, this is what we call CoBase and so that way, new initiatives don't start from zero. Anybody can go in and look up what, you know, who else has looked at a particular problem or the particular area. And this way, they'll never unknowingly pay for the same insights more than once. This is an important point as well. I find too often in experimentation or work, the work that folks do, because it's siloed, they don't know that what they may be seeing is inconsistent with what others have seen in the past. So this way they can look and see maybe they're dealing with Type one or Type two errors and it gives them, you know another, just another dimension, if you will, that really should be considered, that sort of historical dimension that's just often not brought forward because they can't typically. So when we talk about capturing and preserving and reusing knowledge, obviously, there's data, like I said. And many companies will have databases, of course, so they'll have LIMS systems, and this is terrific. This is good for preserving data. Some will have a formal document management system and and, if not, they'll have some way of organizing and filing reports. You know powerpoints, standard operating procedures, if you will, the results of the analyses that engineers, scientists and other analysts are doing. But that's where we want to focus, we want to talk about preserving that knowledge creation work and making it identifiable, if you will, making it so others can find it. So just that, if we want to manage knowledge like an asset, it requires that you bundle it, first of all with data and reports (that would be a good idea) as much of that as relevant as possible and package it, okay. Because when it's packaged, it's identifiable. I keep something on my desk here. You know, I bought a USB cable from Amazon. It came in this box, I hope this shows up all right. And you know it's got a barcode there, it's an asset, it was to Amazon, it is to me now and it's identifiable, right, so that, you know, we can search for it. And if you can search for it, then you can retrieve it, and if you can retrieve it, then you could reuse it. Alright, so you can reference it, you can challenge that knowledge. We think all knowledge is open to be challenged, of course, and we can improve on that knowledge. And finally, managing it as an asset means that we keep it secure, that the knowledge is is is under your control. Okay, so we have a couple of products in this area. We're going to switch over to the demonstration part of this talk. First, is SashLab. We're not going to demonstrate that, but if you want to read more about it, you can download these slides and read up about it, and of course you can contact us. What it is, it's a like a virtual lab or a digital twin. It allows you to experiment virtually or check things out virtually, if you will, before doing...making changes or experimenting physically. CoBase...what CoBase is designed to do is capture everyday research and experimentation and analyses, improvement initiatives, whether they're formal or informal. It puts it into a database and it...and it tags it. And so Farhan is going to show us about tagging and and it also tags...remember, I said it needs to be identifiable, so tagging is one way. Indexing it by factors or by responses and by domains, if you will. We'll talk a bit about that and and this way anybody can go and look... look up the knowledge that was generated by others. And you know, both of these applications, it says at the bottom here, it's like they capture explicit knowledge as an asset. You want to keep it explicit, right. We we don't want to just have opinions or or notions about what's happening when we asked someone we'd like, hey show me the data, the method that it was modeled and the resulting model. It's it's hard that way. It's explicit. So we got the model, we got the development method, we have the underlying data and we make them available for reuse by others. And this way, it avoids delays and costs and I'm going to say anxiety associated with searching for and regenerating knowledge. So if we could, Farhan, why don't you take over the screen and what we'll do is we'll begin a demonstration here. Just while Farhan's bringing that up, let me just add about one of the things we hear about from people when we talk to them is that, you know, it's hard to search for things, just to begin with. It's hard to search for things, but it's really hard to search for things that you don't know exist, right, because that kind of search can go on forever. And so this is kind of what I mean by the sort of the anxiety. It's tiring have to deal with that, and when we are dealing with re search, we would like that research not to have to involve searching for past knowledge. We want to make that easy. So what I'm going to do here, Farhan, you've got the CoBase interface, the primary GUI open? Farhan Mansoor Yes. So, so the homepage yep. Wayne Levin So Farhan, let's say I'm...let's let's go and do a search. And because I've got a problem here, and just before we do this, so let me just describe a little bit at what's going on. We're not going to go into all the detail here. We just don't have the time for it. We're taking the... the example we're using is like a manufacturing process. So down in the bottom left, we have the various steps in semiconductor manufacturing and it's just a way of grouping factors, okay. So what I'm interested in is, you know, I'm interested in you know, let's say, the deposition here. And I'm curious to know if we look at deposition...let's look at like the deposition rate, all right, which is a factor here. And I just want to know, has anybody looked at this in the past deposition rate, let's say between 700 and 3,000 angstroms. It could be 100 reasons why I want to know this and I just want to know, well, what did they look at? What were they... what were the other factors they were looking at? What were the responses? You know, what was the con...you know, I just want to see what's there, because I want to understand this factor. So go ahead, Farhan, you you you fill this in and... great. You clicked on search and there we are. We've got a bunch of files so. You know, we got about what seven files there and they...just looking at them, they go back...one goes back as far back as 2016. So this is work that people have done previously. They've updated it to CoBase. Now, it can be hard to look at all those files, you know, all at once. We can download them. We will do that momentarily but on the right, we're kind of looking at it at a glance so...Farhan, do you want to just...let's look at the parameter distributions first of all. So we see that three of the these JMP data tables also included argon flow that were all looked at in the same levels and backside flow were involved in a couple. There's deposition rate so there's seven across there and we see the different levels there, but they're all between 700 and 3,000 and there are some others as well. We can also look at the just some statistical things. We're going to be adding some more stuff here, but the idea is just to be able to look at some things at a glance, so we get an idea what's going on. And so on the left, we have R squares. On the right, we have root means square errors. Each of the vertical arrangements of dots relates to a JMP file, so I'm just looking at Exp 18-11-01. There you are, Farhan, yeah. So those are three models that have been produced, three prediction formulas that have been saved and we see the R squares vary quite a bit there. Farhan, why don't we go download that file and and let's let's just have a look at at what's there. So. There we go. Awesome. So by the way, the modeling type just also happens to be shown bivariate or fit least squares if they're blue. So we get an idea what what it is. So there we go. Why don't we just go run those three scripts there. Because the predicting...prediction formula is there the...how it was generated is there. Forhan's just rerunning it so. And, of course, the data is there, so we really have the full context of just what was being done. We can see there what the work was. Thanks, Farhan, you're arranging on the screen, I think you ran...well one one twice by the looks of it, they look identical. Farhan Mansoor Oh yes. Wayne Levin Yeah but that's all right. What what we can see if we even if we just look at these two, this is fine. Notice that the one that significant, if you will, it involves temperature. The one that isn't, over on the left of it, temperature's not involved and that may explain why it's not a significant model so. We could go any number of ways, with this, but I hope you get the idea that we want to be able to, you know, search for something, quickly find out what's available, get at a glance what was going on, and then, you know, look at it as much as we want. We may want to now take this data, maybe it solves a problem, maybe the problem I'm dealing with is solved right here. So boom, I don't need to do anything further. I've got my problem solved. Or maybe I want to augment this design, you know, and add some other runs or add some other factors. Or maybe I'm just, hey I see that I need to involve temperature when I go forward and I may not vary it, but I know I'm going to keep it up at a particular level, as I go forward to improve the power of subsequent studies that I may do. So you see what I'm saying, like I'm drawing knowledge from the past, so I'm not beginning from zero. So that's the idea. Why don't we do another quick look up, Farhan. Show another way of looking things up. There's various other ways, but I think a common way would be by tags. The tags are completely customizable so we've got...what do we have there...analysts, we have project, you know. So you've probably had this. I know I've had this. Somebody new comes in and we assign some work to them and we say hey, why don't you look up, Farhan, let's do it by project, and let's just say, hey, why don't you look up the CVD improvement project and there's another one that's kind of like it...if you... YieldPlus. Yeah why don't you go and look at the data, the analyses that were done with this, and then go ahead and click search, and then bang. The files that were associated with it. I know what you're seeing here, by the way, they're all JMP files, but we're going to show you that on the upload, you can include non JMP files as well. So you may want to include, you know, some pictures, pictures of defects, pictures of equipment, instructions, other, you know, documents, anything you want, and they would be listed there as well. I'm sorry, I'm pointing to my screen so. They'd be they'd be listed there as well, so there's other ways to do searches and more comprehensive searches, if you will, but I think you get the idea. And I'd like to ask Farhan, would you mind taking us through the upload? Yep. Yeah why don't you talk us through that, okay? Farhan Mansoor Let's go to the upload interface. Just the upload interface. So what I'm going to do is I'm going to add some sample files to demo the upload process. So you can upload both JMP file and also any other kind of non JMP file. So I'll show you the difference in the process for both. So what I did, is I am picking some PDF, docx, spreadsheets and one JMP file. Now if your file format is JMP then what CoBase will do, it will parse some information out of the file, for example, it will list the column names, as well as, you know, the units, models, you know, data set, those kinds of information. It will also try to guess a standard name, so this I will show, in the end, how to set up but admin users can set up standard names for various parameters and users can choose the standard parameters from here. And if they already exist in the system, CoBase will try to guess. So, for example, in this case temperature has been associated with the temperature parameter that exists in CoBase right now. Argon flow, that doesn't have a assigned parameter right now, because it's new to the system so user can go and pick a standard name. So if I know that argon F L W is the same as our flow in the deposition of step, I can select that one as my standard parameter. Now this is optional. Users don't have to do it, but if they standardized their parameters, their columns then it just makes makes it easier to search for various things. But, or they could come back and do it later, but yeah, but right now, it's an optional feature, optional process. Wayne Levin One of the...one of the keys around this just just to make clear, the column names that are on the left, they're from the JMP data table that Farhan identified. And in order to search for something, we have to have a standard. We have to agree to a standard. So basically what CoBase is allowing us to do, which is what the CO stands for, is we're collaborating asynchronously with you know, colleagues. There's a lot of co names here. We're cooperating. So. We have to agree on the names and so what we're trying to do is make it really easy when you go to upload to assign the proper names to it. So Farhan just said that it's optional. It's true, because we want to make sure that people upload. We know if we make it difficult, they're not going to do it. They'll say, oh, I'll do it later today, and then they won't. Right, then, I'll do it tomorrow. I'll get to it tomorrow, and then they don't. You know, that type of thing. The other thing that we can do is identify, you know, tables that have been uploaded that don't have standard names in them. Like that would be an admin function, if you will, and so they can be corrected later, alright. So so we've got now the nomenclature...are essential to facilitate, you know, system optimization. If we're going to cooperate, we have to agree to the...we have to agree to a standard so that's a big part of it. And something we just added recently, Farhan, you just mentioned the units of measurement. So that's that's part of this as well, so if you're uploading and you'll be reminded of what the unit of measurement is or what it should be and we facilitate changing that if, indeed, you know, somebody measured in millimeters but the standard is centimeters or nanometers whatever. So anyway, we want to facilitate that standardization. Yeah, what else should we mentioned here? There's the tagging you can do, comments, go ahead, Farhan. Farhan Mansoor Yeah so you can add a comment at the file level or you can add a comment at the batch level, have a general notes about the entire upload. The tags, you can also add here, so these are the preexisting tags. So you can add, like if I want to add a tag, I can add series one technology tag, things like that. Wayne Levin Okay, why don't we upload it. Farhan Mansoor Yep, let's upload this. Wayne Levin So just so you know what this consists of, is on the back end, it's a SQL server database and on the front end, it's it's a JMP add-in, so we're running this in JMP obviously and it's installed, like any other add-in. This configuration that needs to happen. CoBase can be installed and up and running in literally in minutes. It really just depends on you. You have a script to install the database and double click, install the add-in, your configuration, and boom you're up and running. So it's pretty easy to do that and oh, we're up there and. Why don't we do a quick search, yeah. Farhan Mansoor So once it finishes uploading, it will kind of...it will give you a batch ID for reference, so you can also look it up by that. So if I do a search for that. Wayne Levin Of course you can look it up, based on the factor names and so on, as well, but yeah, we're just gonna put the batch ID in there, just so we're focused just on this. Farhan Mansoor So that this the files I just uploaded. And you can see the similar plots, parameter distribution plots, if there are any models, it will show up here, since it has only one JMP file. There's so much, and we can download the files as well, so if I download a JMP file, it will open within JMP. But if it's a non JMP file, for example, if it's a doc file, it will open with your default doc viewer, so in my case, it would be Microsoft Word. Wayne Levin Right so flows back in the original search. If we search, hey, who's looked at argon flow, or what have you, we not only get the JMP file here, like what we see here, but we'd see these other files associated with that as well, so so they would come back at you as well. So that's a little bit about uploading. Again we're trying to facilitate the standardization and we're, again, trying to make it easy, really easy to do. Now of course you'd have a bunch of CoBase users out there, and you'd also have a few people who would have the administrative privileges. Why don't we just have a quick look at that, Farhan? Because this is where...we don't get too deep into this. If you want to see more about this or talk more about this, we can talk during the questions or you can contact us after, you know, at any point. So, are you. There we go. Farhan Mansoor On it. Yeah. Wayne Levin I want to say just briefly about how the, you know, setting up the parameters are done here. Farhan Mansoor Yeah so on the left, you see all the steps or well, we call them domains. This could be your production steps or product components or subcomponents. And if I click on one of the steps here, it will show you all the parameters that are currently existing...currently existing this this particular domain and also the subdomains. And admin users can come and add new parameters or edit existing ones, so this will create those standard names that users can then select during upload. The admin users can also assign a standard unit, so on the right side, you see all the standard units associated with the standard parameters. Wayne Levin Right and then the tags there. was mentioned .... there. I'm just gonna go back to the parameters for a moment. You can change the names of these parameters. I'm sorry, we missed a little something there. We could show you that, remember we uploaded a table and we changed argon flow, we changed the name. Well, when we download that table, it will have the correct name and if we ever decide that, you know, for whatever reason, we want to change some of these standard names over time, you may decide that something's a little more descriptive or, you know, you may just want to change it, so you can do that. You can change them here in the admin panel and that will make, let's say, changes within the system so that now, you can search based on those new names and the history will still be brought forward. So we've added that flexibility. It was one of the most difficult things, maybe the most difficult thing, in terms of building CoBase just to begin with. I'm sorry, Forhan, I was taking you away but let's look at the tags, just so they get a sense of that. Farhan Mansoor You have a set of tags that admin users can create. Here you can add new tag types or new tags inside tags, tag types, right now, we can see few examples here. For example, technology tags, study type tags, things like that. Wayne Levin Yeah we have for technology, we have one company, who said look, you know we have different eras, if you will, different technologies and we don't want to throw away stuff that, you know, was done from prior versions, if you will, for a prior technology. So they wanted to be able to name that and so indeed they are are able to do that. You know it's obvious, probably want to tag by analysts, you know, so you can go by somebody's name or whatever, or some project ID. You know, those are pretty obvious tags, but you can create any tags that you want, and you can add tags anytime you want, you know, to this as as they occur to you. So that's the the demo side of this that we wanted to show you. I hope that gives you a flavor for it and, again, you know we welcome any questions that you may have or comments. I'm just going to ...I'm going to switch it back over to my screen. Thank you, Farhan. And see if I can get this. Okay, so we're happy to entertain any questions or thoughts that you may have. Oh goodness, I'm sorry we're gonna have to edit this out, this is the wrong slide. So I'm going to back up. If you if you have any questions or comments, in the slide, we'll have our contact information, when you go to when you download it. Feel free to reach out. We'd be happy to do a more extensive demonstration or talk about some challenges you may have or problems you may have, and how we might be able to solve them with CoBase. And I really appreciate your interest. Thank you.  
Sunday, March 7, 2021
Laura Castro-Schilo, JMP Senior Research Statistician Developer, SAS James R. Koepfler, JMP Research Statistician Tester, SAS   This presentation provides a detailed introduction to Structural Equation Modeling (SEM) by covering key foundational concepts that enable analysts from all backgrounds to use this statistical technique. We start with comparisons to regression analysis to facilitate understanding of the SEM framework. We show how to leverage observed variables to estimate latent variables, account for measurement error, improve future measurement and improve estimates of linear models. Moreover, we emphasize key questions analysts can tackle with SEM and show how to answer those questions with examples using real data. Attendees will learn how to perform path analysis and confirmatory factor analysis, assess model fit, compare alternative models and interpret all the results provided in the SEM platform of JMP Pro.     Auto-generated transcript...   Speaker Transcript Laura Castro-Schilo Hello, I'm Laura Castro-Schilo and welcome to this session, where we're going to learn the ABC of structural equation modeling.   And our goal today is to make sure that you have the tools that are needed for specifying and interpreting models, using the structural equations models platform in JMP Pro 16.   And we're going to do that, first by giving a brief introduction to SEM by just telling you what it is and, particularly, drawing on the connections it has with factor analysis and regression analysis.   And along the way we're going to learn how path diagrams are essential tools for SEM.   And we're going to try to keep that introduction fairly brief, so we can really focus on some hands on examples.   And so, prior to those examples I'm going to introduce the data that we're going to use for that demo and these data are about perceptions of COVID 19 threats.   And so, looking at those data we're going to start learning about how we specify and interpret our models and we're going to do that by answering specific questions.   Those questions are going to lead us to talk about very popular models in SEM, one being confirmatory factor analysis and another one, multivariate regression analysis.   And to wrap it up we're going to show you a model where we bring both of those analyses together, so you really can see how SEM is a very flexible framework where you can fit your own models.   Okay, so SEM is a framework where factor analysis and regression analysis come together, and on the factor analysis side we're able to gain the ability to measure   even those things that we cannot observe directly, also known as latent variables.   And from the regression side we're able to examine relations across variables, whether those are observed, or unobserved and so when when you bring those two together, you get SEM which you can imagine, is a very flexible framework where all sorts of different models can be fit.   Path diagrams are really useful tools in SEM, and the reason is that the systems of equations that can be, you know, fairly complicated   can actually be represented, through these diagrams. And so as long as we know how to draw the diagrams and how to interpret them,   then we're able to use them to our advantage. So here we have rectangles and so those are used exclusively for representing observed variables.   Circles are used to represent latent variables, and double-headed arrows are used for representing both variances and covariances, and one-headed arrows are for regression or loading effects.   There's another symbol that's often used in path diagrams and it's sort of outside the scope of what we're going to talk about today.   But if you come across it, I just want to make sure you know it's there, and that is a triangle. Triangles are used to represent means and intercepts and so there's all sorts of interesting models we can fit,   where we are modeling the mean structure of the data, but again we're not gonna have time to talk about those today.   Now, when it comes to path diagrams I think it's useful to think of what are the building blocks   of SEM models so that we can use those to build complex models. And one of those would be a simple linear regression. So here, you see, we have   a linear regression where Y is being regressed on X. And notice both X and Y are in these rectangles, in these boxes, because they are observed variables.   And we're using that one-headed arrow to represent the regression effect and the two-headed arrows that start and end on the same variable   represent, in this case, the variance of X, and in the case of Y, it's the residual variance of Y. If a double-headed arrow were to start in one variable and start...and end at the other, then that would be a covariance.   Now, in SEM any variable can be both an outcome and a predictor. So in this case, Y could also take on the role of a predictor if we had a third variable Z, where Y is predicting Z.   And so we can build sequential effects, this type of sequential regressions, you know, as as many as you need depending on on your data.   Another building block would be that of a confirmatory factor model, and so that's basically   the way that we specify latent variables in SEM. And this particular example is a very simple one factor, one latent variable   confirmatory factor model where the circle represents the unobserved latent variable.   And notice that latent variable is has one-headed arrows pointing to the variables that we do observe, in this case W, X and Y.   And the reason why that variable points to those squares is because in factor analysis, the idea is that the latent variable causes the common variability we observed across W, X and Y.   And this is really important to understand because it's often confused when we think about principal components from like a principal   components analysis perspective. And so I think this is a good opportunity to sort of draw the distinctions between latent variables from a factor analytic perspective and components from a PCA perspective, so I'm going to take a little bit of a tangent to explain those differences.   In this image, the squares represent the variables that we measured, those observed variables. And notice, I'm using these different amounts of   blue shading on those variables to represent the proportion of variance that is due to what we intended to measure, sort of the signal, the things that we wanted to measure with our instruments.   And the gray shaded areas are the proportion of variance that is due to any other sources of variance. It can include measurement error, but it can also include systematic variance that is   unique to each of those measurement instruments.   And so, in the case of factor analysis, the latent variable captures all of that common variability across the observed variables, and so that's why we're using this solid blue to represent the latent variable.   And that's in contrast to what happens in principal component analysis, where the goal is dimension reduction. And so in PCA, the component is going to explain the maximal amount of variance   from the dimensions of our data. And so that means that that that principal component is going to be often a combination of the variance that's due to what   what we wanted to measure, but also to some other sources of variance. All right, and so again the the diagram illustrates also those the causal assumption, the fact that latent variables are   hypothesized to cause the variability in their indicators in the observed variables, and so that's why those one-headed arrows are pointing toward the observed variables and that's not the case in in PCA.   Alright, so I think this is a useful distinction to make when we're talking about latent variables in SEM,   very often, what we're talking about is is the latent variables from a factor analysis perspective.   Okay, so here I've chosen to show you a path diagram that belongs to a model that's already been estimated. So we have all of the values here   on these arrows because those are all estimates from the model, and I think that this diagram does a good job at illustrating why one might use SEM.   First, we see that we have unobserved variables. Right here, conflict is an abstract construct that we can't necessarily observe directly and so we're defining it as a latent variable by leveraging the things that we do observe, in this case we have three survey questions that represent   you know, that that unobserved Conflict variable. We are also able to account for measurement error, the way in which latent variables are defined in SEM   assures us that we are, in fact, accounting for measurement error, because those latent variables are only going to sort of capture the common variance across all of these observed variables.   Also notice that we are able to examine sequential relations in SEM. So we have this unobserved conflict variable but we're also able to see,   you know, how does this Support variable, how does that influence Work and then, how does this Work variable in turn influence the latent variable?   And ultimately, how Conflict, the unobserved variable, can predict all sorts of other outcomes. And so these sequential relations are very useful and very easy to estimate in SEM.   Another good reason to use SEM is that in JMP Pro, our platform uses cutting-edge techniques for handling missing data, so even if you have a simple linear regression and that's really all you need, if you have missing data,   SEM makes sure that everything that's present is being used for estimation, and so that can be very helpful as well.   If this...   if what I've said so far piques your interest and you plan on learning more about SEM, without a doubt, you're going to find a lot of terminology   that is unique to the field. And so like anything else, there's jargon that we need to become familiar with.   And so, this diagram is also useful to introduce some of that jargon. First, we've been talking about observed variables or measured variables.   In SEM those are often called manifest variables. We have latent variables, which we discussed already.   But we also have this idea of exogenous variables and those are the ones that are only going to predict other variables.   In our model here, we only have two of those and they are in contrast to endogenous variables. And so every other variable here is an endogenous variable because they have other variables predicting them.   Alright, so those are endogenous variables. We also have latent variable indicators and so these are the variables that are caused by the latent variables.   And the residual variance that is not explained by the latent variable is called uniquenesses, and they're often called as well unique factor variances. And so remember that this is the combination of systematic variance that is unique to that variable in addition to measurement error.   I find it useful when people are learning SEM to kind of have a shift in focus of what   what the model is really doing. So in other words, by realizing that we're doing multivariate analysis of a covariance structure   (and also means, but remember that we're not talking about means today), but by realizing that what we're actually analyzing is the structure   of the covariances of the data, that helps sort of wrap our heads around SEM a lot more easily.   Because it has implications for what we think the data are, right? So, for example, you know, we can have our data tables, where each row represents a different observation and each column is a different variable.   And we can definitely use those data to launch the SEM platform in JMP.   But in the background, sort of behind the curtain, what the platform is doing is looking at the covariance matrix of those variables, and that is, in fact, the data that are being analyzed.   And so this also has implications for the residuals. Oftentimes when we think about residuals in SEM, those are with respect to that covariance matrix that we're analyzing.   And this is also true for degrees of freedom, and the degrees of freedom are going to be with respect to this covariance matrix.   Right, and so I want to make sure that I give you a little taste of how SEM works in terms of its estimation.   And so the way we start is by specifying a model, and thankfully, in JMP Pro, we have this really great friendly user interface   where we can specify our models directly with path diagrams, rather than having to specify or list, you know, complex systems of equations. You can simply draw the path diagrams that imply a specific covariance structure.   And so the diagrams imply a covariance structure for the data and then during estimation, what we do is try to obtain model estimates that match the sample covariance matrix as closely as possible, based on the model implied   constraints, basically. And once we have those estimates, we can plug them into the model implied covariance matrix and compare those values against the sample covariance matrix,   and the difference between them allows us to quantify the fit of the model, right. So if we have large residuals, by looking at the difference between these two covariance matrices, then we know that we have not done a very good job at fitting our model.   Alright, so, in a nutshell that's how SEM works, and I'd like to take now the next part of the presentation to introduce the data that we're going to use for our demo. I do think that   it's easier to learn new concepts by getting our hands on some real data, real real world examples.   So the data that we're going to use actually come from a very recently published article that was published in the Social, Psychological and Personality Science journal, and so this was published in the summer of 2020.   And the authors wanted to answer a very simple question. They said, you know, how do perceived threats of COVID 19 impact well being and public health behaviors? And so   it's it's a simple question, except for the fact that perceived threats of COVID 19 is a completely new construct, right. It's a very abstract idea.   You know, what what is perceived threats of COVID 19 and how do you measure that, right? And so, because this is something that has never been measured before, the authors had to engage in a very careful   study where they developed a survey to be able to measure those threats. And   developing a survey is not easy, right. We need to make sure that our questions for our survey are reliable, that they're valid.   And so they had to go through the process and we're going to see how they did that in a minute.   Now, in their study they found that there's two types of threats that they could measure, one they called realistic threats, and that's   things that threaten our financial and physical safety.   And the other type of threat was...they called it symbolic. So those are things that threaten our social, cultural identity, right. And it's also important to say this sample was for the United States population. They sampled   over 1,000 individuals and so their questions pertain exclusively to the United States population.   And what we see here, this is actually the integrated COVID 19 threats scale, so this is the questionnaire that they developed after going through three different studies. And so they found that those two threats   could be measured with a handful of items. They asked their participants to answer how much of a threat, if any, is the coronavirus outbreak, for your personal health, the health of the US population as a whole,   your personal financial safety and so on. And for symbolic threat, the questions were, you know,   how much of a threat is the virus for what it means to be an American, American values and traditions, and the rights and freedoms of the United States population as a whole, and so on. So you can see the differences in what these threats represent.   So we had access to these data and we're going to use those data to answer very specific questions. First, how do we measure these perceptions of COVID 19 threat and we're going to focus on the two threats that they identified.   And so, this is going to lead us to talk about confirmatory factor analysis and assessing a measurement model to make sure we can figure out if the questions in the survey are, in fact, reliable and valid.   Notice we're going to skip over this very important first step, which is exploratory factor analysis and that's something that one would do before   using SEM, right. You would run an exploratory factor analysis and then you come to SEM to confirm the structure of that...of the previous results. The the authors of this article definitely did that but we're going to focus on the steps that we would follow using SEM.   The second question is, do perceptions of COVID 19 threat predict well being markers and public health behaviors. And so this this question is going to lead us to talk about multiple regression and path analysis within SEM.   And the last question is are effects of each type of threat on outcomes equal. And this actually allows us to to show a very cool feature of SEM, which involves setting equality constraints in our models and conducting systematic model comparisons to answer these types of questions.   Alright, so it's time for the demo, and I already have...let's see....   Oops, how do I get out of here?   It's not time for questions yet.   I just want to exit the screen and I can't seem to do it.   Okay, here we go, so we have...   I already have the data table from JMP open right here. This...these data, you can see there's 238 columns so that's because the authors asked   a number of different questions from 550 participants in this case; this is one of their three studies.   And the first 10 questions, the ten first columns that I have in the data correspond to those 10 questions we saw in their threats scale.   And so, those are going to be the ones we use first to do a confirmatory factor analysis. And so we're going to click analyze, we'll go to multivariate methods, structural equation models.   And we are going to use those 10 variables and click model variables, and then we're going to click OK to launch the platform.   A notice that on the right hand side we immediately see there's a path diagram.   And that diagram has already, you know, all of the features that we discussed earlier, so each of the variables are in rectangles, suggesting that they're observed variables.   And each of them have these double-headed arrows that start and end on themselves, so they represent a variance of each of those variables.   Now, if I right click on the canvas there's a show menu, and notice that the means and intercepts are hidden by default.   I'm going to click on this just to show you that we do, in fact, have   means estimated by default from all of these variables. And so we're not going to talk about those, so we're going to keep them hidden, but I do think it's important to know   that the default model that we start with when we launched the platform is one where all of the variables have variances and means estimated.   Now, on this   tab, we have a list tab, and if we click on that we see that we have   the exact same information that we have in the diagram but in a list form. And so all of the different types of parameters are split based on the type of parameter it is, so we have all of the variances here and all of the means over there.   Right, we have a status tab and this basically tells us, you know, about the the specific model we have specified right now. It gives us a bunch of useful information about that model.   We have details about the data, our sample size, the degrees of freedom, and we also have these identification rules. You can click on them if you want to learn a little bit more about them. It gives you a bit...   A little description to the right.   But what's really helpful to know is that this icon for the status tab is constantly changing, depending on the changes we we do and depending on the specification of the model we have. And so oftentimes, if we have an advanced application of SEM, this   icon might be the color yellow and when we have a bad error, some some important mistake, then that icon is going to be an orange with an X, basically indicating that there's there's an error. So it could be very useful to identify mistakes as we are specifying our models.   Now to the left side of the user interface, we see that we can   specify the name of our model, so this is very helpful, sort of to keep track of our workflow. And we also have this From and To lists.   And so, these lists provide a very useful way to link variables, either using a one-headed arrow or a two-headed arrow.   So here, for example, if I want to link these, I can click that button and very quickly I've drawn a path diagram, right. So it's a very efficient way to specify models.   And so I'm going to click on reset here just to go back to the model that we had upon launching the platform, but know that the From and To lists are basically ways in which we can draw the diagrams.   Okay, in this case we have all of the observed variables listed here, but I know that we want to use those variables to to specify latent variables.   Now, the first five variables here are the ones that correspond to the items in that survey for the realistic threat. And so I'm going to add a latent variable to the model by going down   to this box down here, where it says Latent1 and I'm going to change the name to Realistic, because I want   these five variables to be the indicators of a realistic threat latent variable. And so by clicking on this button,   I immediately get that latent variable specified. And notice, the first loading for this realistic threat latent variable has a 1 in this...   in this in this arrow, and that basically represents the fact that the parameter is fixed to the value of 1.   And we do this because we need to set the scale of the latent variable. Without this constraint, we would not be able to identify the model and so   by default we're going to place that constraint on the first loading of the latent variable, but we also could achieve   the same purpose if we fixed the variance of the latent variable to 1. So which one we do is really a matter of choice, but as a default, the platform will fix the first loading to 1. Okay, so we have a realistic threats latent variable and the other five   variables here are the ones that correspond for, you know, to the symbolic threats questions. And so I'm going to select those and click here. I'm going to type Symbolic and I'm going to click the plus button to add that symbolic threat.   Okay, so we're almost done, but notice that this model here is, is implying that realistic and symbolic threats are perfectly uncorrelated with each other, and that's a very strong assumption. And so we don't want to do that.   For the most part, most confirmatory factor models allow the latent variables to covariate with each other, and so I'm going to select them here, and I can click this double-headed arrow to link those two nodes.   But I can also do it directly from the path diagram. So if I right click on the latent variable, I can click on add covariances and   right there, I can add that covariance. So it's it's a pretty cool way. You can do it with the list, you can do it directly on the diagram,   whatever is your choice. And so our model is is ready to be estimated, so I'm going to change the name to 2-Factor CFA and we can go ahead and run it.   And you can see, very quickly, we obtain our estimates and they're all mapped onto the diagram, which is pretty cool. But before we interpret those results, I want to make sure we focus on this model comparison table.   The reason is that table provides us a lot of information about the fit of the model, and we want to make sure the model fits well before we interpret the results. So   the first thing to notice here is that we have three models in this...in this table and we only fit one of them.   So the reason we have three is because the first two models, the unrestricted and independence models, aren't fit by default up on launching the platform.   And so we fit these models on purpose to kind of provide a baseline for what's a really good fitting model and what's a really bad fitting model, and so we use those as a frame of comparison   with our own specified models. So let me be a little more specific. For example, the unrestricted model would be a model (I'm going to show you with the path diagram),   the unrestricted model is one where every variable is allowed to covary with each other, all right. And so notice that   the Chi square statistic, which is a measure of misfit, is exactly zero, and the reason is because this model fits the data perfectly. Remember our data here really being the covariance matrix of the data right, and so we have zero degrees of freedom because we've specified...   have zero degrees of freedom because we are estimating every possible variance and covariance in the data.   So this is the best possible scenario, right. We have no misfit but we're also estimating every possible estimate from the data.   The other end of the spectrum is a model that would fit really bad and that's what the independence model is. So if I show you with the path diagram,   here our default model where we only have variance as a means for the data, that is exactly what the independence model is. And   that is essentially a model where nothing is covarying with anything else, and you can see the Chi square statistic for that model is in fact pretty large, because there's a lot of misfit, right, so it's almost 2000 units, but we do have   45 degrees of freedom because we're estimating very few things from the data. And so again, these two models basically provide   the two ends of the spectrum, right. On the one hand side, a really good fitting model and on the other side, a really poor fitting model, and so we're going to be able to use that information to compare our own model against against those.   So, if we look at our model. Notice the Chi square statistic is not zero, but it is only 147 units, which is a lot less than 2000.   And we have 34 degrees of freedom, so we do have some misfit. And when we look at the test for that Chi Square, it is a significant Chi square statistics, so it suggests that we have a statistically significant misfit in the data.   However, the Chi square statistic is influenced by sample size, and, in this case we have 550 observations. And so   usually, when you have 300 or more observations, it's very important to not only look up the Chi square statistic, but also at some special   fit indices that are unique to SEM that allows us to quantify the fit of the model, and that's what the values are over to the right here.   This first fit index is called the comparative fit index and that   index ranges from zero to one, so you can see the unrestricted model has a one. That's the best fitting model and the independence model has zero, because the worst fitting model, alright, and our model actually has a CFI of .93, about .94.   And so that represents the proportion of improvement from the independence model. So another way to say that is   our model fits about 94% better than the independence model does, so that's actually pretty good. And usually we want to CFI values of .9 or higher. The closer to one, the better.   Now the root mean square error of approximation is another fit index, but that one,   although it also ranges from zero to one, we want very, very low values in that index. So notice the unrestricted model has a value of zero   and the independence model is .27. We usually want values here that are .1 or lower for acceptable models. And ours has a .07, about .08, and that's actually pretty good.   We also have some confidence intervals for this particular estimate, and you can see that those values are also below .1, so this is a good fitting model, right. And so once we know that the model has...fits our data well, then we can go ahead and interpret it.   Now, as a default in our estimates, we are going to show you the unstandardized parameter estimates.   But for factor analysis, it's much more useful to look at the standardized solution so I'm going to right click on the canvas and I'm going to show estimates standardized.   And so, now the values here are in a correlational metric so we want those values to be as close to one as possible,   because they represent the correlation of the observed variable with the latent variable. And notice, both for realistic and symbolic threat, the values are pretty good.   We don't want them to be any lower than about .4, and so these values are good. Another thing that is very unique and really useful, it's unique to JMP Pro, is that the variables here the...any variable that's endogenous that has   predictors, right, pointing at them, it's shaded.   Notice here there's a little bit of gray inside these squares, and so that shading is proportional to the amount of variance explained by the predictors.   And so it allows us to visually see very quickly which variables were doing a really good job at explaining their variance. In this case, it seems like these three variables   are filled the most with that darker gray, suggesting that the symbolic threats latent variable is doing a pretty good job at explaining the variance of these three observed variables.   We also see that the two latent variables are correlated about .4, which is is an interesting finding.   And there's all sorts of output that we could focus on here in the red triangle menu, but I'm going to focus specifically on one   option called assess measurement model. And this is where we're going to find a lot of statistics where we can quantify the reliability and the validity of our constructs.   So if we click there, we have this nice little dashboard. And the first information we have here is   indicator reliability, so this quantifies the reliability of each of the questions in that survey and we provide a plot that is.   showing us all of these values. And notice, we have a red line here for for a threshold of what we hope to have, right. We want to have at least that much reliability in each of our items. Now,   you know, these types of thresholds need to be interpreted, you know, with our own critical thinking, because obviously,   you know, this this particular item, for example, is is below the threshold, but it's still pretty close to the threshold so we're not going to throw it out. We can still consider it   relatively reliable and and it's still a good indicator of this latent variable. But again, just interpret the thresholds here with caution.   But one thing that is apparent from this plot is that the symbolic threats latent variable appears to have more reliable indicators than the realistic threats. They're both pretty good, though, but the symbolic one, you know, we're doing a better job of measuring that.   The values to the right are reliability coefficients for the composites. In other words, they quantify the reliability of the latent variable as a whole and there's two types of reliability.   I'm not going to get into the details of their differences but notice these values range from zero to one and we want them to be as close to one as possible. And we also provide   you know, some plots with the threshold of sort of indicating what's the desired amount of reliability that we want, the minimum and, in this case, both realistic and symbolic threat have good reliabilities.   And the other visualization we have here is for a construct validity matrix and   keep in mind that when you're trying to measure something that you don't see directly, it's very hard to figure out if it really   is what you intend it to measure. Are you really measuring what you wanted? And that's what this information allows us to determine.   The the visualization here is portraying the upper triangular of this matrix, and let me just explain briefly what the values represent. In this lower...   the below the diagonal, we have the correlation between the latent variable. That's about .4. The diagonal entries represent the average amount of variance extracted   from the...that the latent variables extract from their indicators. And so you want those values to be as high as possible.   And above the diagonal we have this squared correlation. In other words, it's the amount of overlapping variance between the latent variables.   And so the key to interpreting this matrix is we want values in the diagonal to be higher   than the values above and to the right of the diagonal, and notice here, the visualization makes it very easy to see that we do, in fact,   have larger values in the diagonal than we have above or to the right. And that is good evidence of construct validity.   And so, everything here is suggesting that both the realistic and symbolic threats are, in fact,   latent variables that that are valid, that are reliable, and the survey seems to do a good job of measuring both of these.   So a next step might be, we could choose perhaps to grab all of those five questions that represent the realistic threats here, and we could create an average across all of these.   And all of a sudden, we would have one measure that represents realistic threats. We could do that and we could do the same for the other five variables that represents symbolic threats.   And so let's just for illustration, I actually have already created those variables, so let's go to analyze, multivariate methods, structural equation models. And I'm going to look for those   average variable. So that realistic and symbolic threats here, these are the average across the columns for each of these variables.   And I'm going to model those in addition to...we have a measure for anxiety, we also have a measure for negative affects or negative negative emotions.   And lastly, we have a measure for adherence to public health behaviors, and so we're going to model that as well, and we're going to click OK to launch the platform.   The diagram buttons here, we can go into a customized menu, we can change all sorts of aspects of the diagram, which is really, really great. Right now, I'm just going to focus on increasing the width of...   the width of these   variables, so that we can read what's inside the nodes. And what I'm going to do is fit a model where both realistic and symbolic threats are going to be predictors of these interesting outcomes, right. There's sort of   markers for, you know, anxiety, negative affect, and also the public health behavior, so we're going to link these with a one-headed arrow to specify the predictions. So we're going to investigate whether these effects are, in fact, significant.   Now notice I'm not fully done specifying this model yet, because in this particular model,   there's no connection between the realistic and the symbolic threats, and that would be a very strong constraint in the model to say that these two things aren't   covarying at all. And so we always want to make sure that we include covariances between our predictors and also between the the residual variances of our outcomes.   And so we could specify those directly from the From list, in this case I'm going to use add covariances from this menu, and I'm going to link the realistic and symbolic threats.   I'm also going to use the lists to add covariances between the residuals of these outcomes. And now we have a full correctly specified model, and this is often called path analysis but it's also...   it's basically a collection, a simultaneous collection of regression models and so we're going to run that.   And notice from the model comparison table that our model has no degrees of freedom and has a perfect Chi Square and the Chi square is zero.   But essentially by having zero degrees of freedom, it means that our model is not testable,   because we've extracted all the information we could have extracted from our data. So that's essentially what we do when we fit a regression model. So there's no problem with that, but just know that you can't interpret this Chi Square and say, oh my model fits so well.   It fits well because you've extracted everything you could have extracted from the model. So anyhow, it's just like a regression model.   Alright, so if we go and look at the results, you know, there's all sorts of really important information that we can interpret but I'm going to focus on a couple of things. First,   notice our diagrams are fully interactive, which is really, really cool. And I'm just moving things around to focus on a couple of effects. I'm going to hide the variances and the covariances in this model, so that we can really focus on the results for from the regression models,   from the path analysis. And notice here, so realistic and symbolic threats, both of them have a positive effect on anxiety.   So that's really interesting and here the arrows are solid because the effects, you can see here in the table of parameter estimates, are statistically significant. So if they were insignificant,   actually, the arrows would show up as dashed arrows. So the path diagram conveys a lot of information. So we have positive significant effects on anxiety   and that's interesting, of course, but so far, all we've done, again, is is fit regression models in a simultaneous way.   And, in fact, if we go back to the data table, I have a script here from fit model, where I actually use that same anxiety   outcome and the same two predictors, realistic and symbolic threats,   and I simply estimate a multiple regression model. And the reason I wanted to show you this is because, notice the parameter estimates here are exactly the same value than we obtained from SEM.   And that's no surprise, because, in fact, we are doing a simultaneous regression. So up until this point, you might wonder   what does SEM buy you, right, because technically, you could run three separate fit models with these three outcomes and you could obtain the same information that we've obtained so far.   However, well if you have missing data, you still want to use SEM because then we're going to use...all of the data are going to be used rather than dropping rows.   However, if you want to use SEM you're also going to be able to answer additional questions that are pretty interesting.   In this case, we might wonder whether the effect that realistic threat has on anxiety is statistically greater   than the effect that symbolic threat has on anxiety, right. So far, we know that they're both significantly different from zero but are they significantly different from each other?   And that is a question that we can answer by using the SEM platform. And going back to our model specification panel, we can select both of those effects and up here in the Action buttons we have the set equal button, so if we press that, notice we get a little   label here that implies that both of these effects are going to be set to equal. They're going to be estimated as one. And so if we change the name here to equal effects and we run this model, we're going to obtain the fit statistics for that specific   model that has the equality constraints. And notice we've now gained one degree of freedom, so all of a sudden, we have a testable model.   And we can use the model comparison table to select the two models that we want to compare against each other, and clear your click compare selected models.   And now we obtain a Chi square difference test, so we're able to compare the model statistically and see what is the amount of misfit that our   equality constraint induces in the model. And here we can see it's it's about 8.6 units in the Chi square metric and the p value for that is, in fact, significant. So this suggests that   setting that equality constraint induces a significant amount of misfit in the model. And we also, because we know Chi square is influenced by by sample size, we also have the difference in those   fit indices that we discussed and for the CFI, we usually don't want this to increase, any more than .01   at the most. In fact .01 or higher is is not so good. And for RMSEA, you know, you don't want this to be any higher than .1.   So all of the evidence here suggests that setting the equality constraints leads to a significantly worse fitting model.   In other words, if we go back to the model that fit best, we're now able to say, based on that Chi square difference test,   that the effect that realistic threat has on anxiety is significantly higher than the effect the symbolic threat has on anxiety.   And so those types of questions, you know, we could address them with other parts of this model, but again SEM affords a lot of flexibility by allowing us to compare the equality of different effects within the model.   Okay, and so, in the interest of time I'm going to close this out, but, and I do want to show you so far, you know we saw a confirmatory factor model. We also saw   a path analysis, where we're doing a multivariate regression analysis, but we can actually use both of those concepts in one model. And so I have a script that I've already saved in my data table, and you can see what I'm doing here in this model is actually   estimating latent variables. I'm modeling latent variables for both symbolic and realistic threats, using the original items from the survey, from the questionnaire.   And so, by doing this, instead of creating averages across the columns,   I'm actually going to model the latent variables, and that allows me to obtain regression effects, all of these effects amongst latent variables are going to be unbiased and unattenuated by measurement error, because I'm obtaining a more   a more valid, a more pure measure of symbolic and realistic threats. And so here we are estimating, you can see sequential relations, and my model here is a lot more complex.   I'm not going to get into the details of the model, but just know that by modeling latent variables and looking at the relations between latent variables, we're really able to obtain the best   functionality from SEM because our associations between those latent variables are going to be better estimates of...   for the model. And I actually estimated this model, you can see the results down here. And so notice here, there's a few edges that are...   that have arrows that are sort of dashed, indicating that those effects are not significant. We also see how powerful the visualization of the shading is, right. We're able to explain   some proportion of variance of adhering to public health behaviors. And it seems like we're doing a better job of explaining variance on on positive affect than we are on any of the other outcomes here. And so again it's   basically, the best of both worlds, being able to specify our latent variables but also model them directly using our platform.   And so, with that I'm going to stop the demo here, but I will direct you to the fact that in the JMP Community website, we have supplementary materials that   James Koepfler has created. They are really great materials that have a lot of tips on how to interpret our models, how to use the model comparison table,   and basically all the notes that you would have wanted to take during this presentation, you can get them on the supplementary materials. And so with that I am ready to open it up for questions.
Laura Lancaster, JMP Principal Research Statistician Developer, SAS Jeremy Ash, JMP Analytics Software Tester, SAS Chris Gotwalt, JMP Director of Statistical Research and Development, SAS   Uncontrolled model extrapolation leads to two serious kinds of errors: (1) the model may be completely invalid far from the data, and (2) the combinations of variable values may not be physically realizable. Using the Profiler to optimize models that are fit to observational data can lead to extrapolated solutions that are of no practical use without any warning. JMP Pro 16 introduces extrapolation control into many predictive modeling platforms and the Profiler platform itself. This new feature in the Prediction Profiler alerts the user to possible extrapolation or completely avoids drawing extrapolated points where the model may not be valid. Additionally, the user can perform optimization over a constrained region that avoids extrapolation. In this presentation we discuss the motivation and usefulness of extrapolation control, demonstrate how it can be easily used in JMP, and describe details of our methods.     Auto-generated transcript...   Speaker Transcript Hi, I'm Chris Gotwalt. My co presenters, Laura Lancaster and Jeremy Ash, and I are presenting an useful new JMP Pro capability called Extrapolation Control. Almost any model that you would ever want to predict with has a range of applicability, a region of the input space, where the predictions are considered to be reliable enough. Outside that region, we begin to extrapolate the model to points far from the data used to fit the model. Using the predictions from that model at those points could lead to completely unreliable predictions. There are two primary sources of extrapolation statistical extrapolation and domain based extrapolation. Both types are covered by the new feature. Statistical extrapolation occurs when one is attempting to predict using a model at an x that isn't close to the values used to train that model. Domain based extrapolation happens when you try to evaluate at an x that is impossible due to scientific or engineering based constraints. The example here illustrates both kinds of extrapolation in one example. Here we see a profiler from a model of a metallurgy production process. The prediction reads out says -2.96 with no indication that we're evaluating at a combination of temperature and pressure that is impossible in a domain sense to attain for this machine. We also have statistical extrapolation as it is far from the data used to fit the model as seen in the scatter plot of the training data on the right. In JMP Pro 16, Jeremy, Laura and I have collaborated to add a new capability that can give a warning when the profiler thinks you might be extrapolating. Or if you turn extrapolation control on, it will restrict the set of points that you see to only those that it doesn't think are extrapolating. We have two types of extrapolation control. One is based on the concept of leverage and uses a least squares model. This first type is only available in the Pro version of Fit Model least squares. The other type we call general machine learning extrapolation control and is available in the Profiler platform and several of the most common machine learning platforms in JMP Pro. Upon request, we could even add it to more. Least squares extrapolation control uses the concept of leverage, which is like a scaled version of the prediction variance. It is model- based and so it uses information about the main effects interactions in higher order terms to determine the extrapolation. For the general machine learning extrapolation control case, we had to come up with our own approach. We wanted a method that would be robust to missing values, linear dependencies, faster compute, could handle mixtures of continuous and categorical input variables, and we also explicitly wanted to separate the extrapolation model from the model used to fit the data. So when we have general extrapolation control turned on, there's only one supervised model that is...that fits the input variables to the responses that we see in the profiler traces. The profiler comes up with a quick and dirty unsupervised model to describe the training set axes, and this unsupervised model is used behind the scenes by the profiler to determine the extrapolation control constraint. So I'm having to switch because PowerPoint and my camera aren't getting along right now for some reason. We know that risky extrapolations are being made every day by people working in data science and are confident that the use of extrapolations leads to poor predictions and ultimately to poor business outcomes. Extrapolation control places guardrails on model predictions and will lead to quantifiably better decisions by JMP Pro users. When users see an extrapolation occurring, the user must make a decision about whether the prediction should be used or not used based on their domain knowledge and familiarity with the problem at hand. If you start seeing extrapolation control warnings happen quite often, it is likely the end of the life cycle for that model in time to refit it to new data because the distribution of the inputs has shifted away from that of the training data. We are honestly quite surprised and alarmed that the need for identifying extrapolation isn't better appreciated by the data science community and have made controlling extrapolation as easy and automatic as possible. Laura, who developed it in JMP Pro, will be demonstrating the option up next. Then Jeremy, who did a lot of research on our team, will go into the math details and statistical motivation for the approach. Hello, my name is Laura Lancaster and I'm here to do a demo of the extrapolation control that was added to JMP Pro 16. I wanted to start off with a fairly simple example using the fit model least squares platform. I'm gonna use some data that may be familiar; it's the Fitness data that's in sample data and I'm going to use Oxygen Uptake as my response and Run Time, Run Pulse and Max Pulse as my predictors. And I wanted to reiterate that in fit model, fit least squares the extrapolation metric that's used is leverage. So let's go ahead and start to JMP. So now I have the fitness data open in JMP and I have a script saved to the data table to automatically launch my fit least squares model. So I'm going to go ahead and run that script, it launches the least squares platform. And I have the profiler automatically open. And we can see that the profiler looks like it always has in the past, where the factor boundaries are defined by the range of each factor individually, giving us rectangular bound constraints. And when I change the factor settings, because of these bound constraints, it can be really hard to tell if you're moving far outside the correlation structure of the data. And this is why we wanted to add the extrapolation control. So this has been added to several of the platforms in JMP Pro 16, including fit least squares. And to get to the extrapolation control, you go to the menu under the profiler menu. So if I look here, I see there's a new option called Extrapolation Control. It's set to off by default, but I can turn it to either on or warning on to turn on extrapolation control. If I turn it to on, notice that it restricts my profile traces to only go to values where I'm not extrapolating. If I were to turn it to warning on, I would see the full profile traces, but I would get a warning when I go to a region where it would be considered to be extrapolation. I can also turn on extrapolation details, which I find really helpful, and that gives me a lot more information. First of all, it tells me that my metric that I'm using to define extrapolation is leverage, which is true in the fit least squares platform. And the threshold that's being used by default initially is going to be maximum leverage, but this is something I can change and I will show you that in a minute. Also, I can see what my extrapolation metric is for my current settings. It's this number right here, which will change as I change my factor settings. Anytime this number is greater than the threshold, I'm going to get this warning that I might be extrapolating. If it goes below, I will no longer get that warning. This threshold is not going to change unless I change something in the menu to adjust my threshold. So let me go ahead and do that right now. So I'm going to go to the menu and I'm going to go to set threshold criterion. So in fit least squares, you have two options for the threshold initially,it's set to maximum leverage, which is going to keep you within the convex hull of the data, or you can switch to a multiplier times the average leverage or model terms over observations. And I want to switch to that threshold. So it's set to 3 as the multiplier by default. So this is going to be 3 times the average leverage and I click OK, and notice that my threshold is going to change. It actually got smaller, so this is a more conservative definition of extrapolation. And I'm going to turn it back to on to restrict my profile traces. And now I can only go to regions where I'm within 3 times the average leverage. Now we have also implemented optimization that obeys the extrapolation constraints. So now if I turn on set desirability and I do the optimization, I will get an optimal value that satisfies the extrapolation constraint. Notice that this metric is less than or equal to the threshold. So now when I go to my next slide, which is going to compare in a graph, a scatterplot matrix, the difference between the optimal value with extrapolation control turned on and with it turned off. So this is the scatterplot matrix that I created with JMP, and it shows the original predictor variable data, as well as the predictor variable values for the optimal solution using no extrapolation control, in blue, and the optimal solution using extrapolation control in red. And notice how the unconstrained solution here in blue, right here, violates the correlation structure for the original data for run pulse and Max pulse, thus increasing the uncertainty of this prediction. Whereas the optimal solution that did use extrapolation control is much more in line with the original data. Now let's look at an example using the more generalized extrapolation control method, which we refer to as a regularized T squared method. As Chris mentioned earlier, we developed this method for models other than least squares models. So we're going to look at a neural model for the Diabetes data that is also in the sample data. The response is a measure of disease progression, and the predictors are the baseline variables. Once again, the extrapolation metric used for this example is the regularized T square that Jeremy will be describing in more detail in a few minutes. So I have the Diabetes data open in JMP and I have a script saved of my neural model fits. I'm going to go ahead and run that script. It launches the neural platform, and notice that I am using validation method, random hold back. I just wanted to note that anytime you use a validation method, the extrapolation control is based only on the training data and not your validation or test data. So I have the profiler open and you can see that it's using the full traces. Extrapolation control is not turned on. Let's go ahead and turn it on. And I'm also going to turn on the details. You can see that the traces have been restricted and the metric is the regularized T square. The threshold is 3 times the standard deviation of the sample regularized T squared. Jeremy is going to talk more about what all that means exactly in a few minutes. And I just wanted to mention that when we're using the regularized T squared method, there's only one choice for threshold, but you can adjust the multiplier. So if you go to extrapolation control, set threshold, you can adjust this multiplier, but I'm going to leave it at 3. And now I want to run optimization using extrapolation control. So I'm just going to maximize and remember. Now I have an optimal solution with extrapolation control turned on. And so now I want to look at our scatterplot matrix, just like we looked at before, with the original data, as well as with the optimal values with and without extrapolation control. So this is a scatterplot matrix of the Diabetes data that I created in JMP. It's got the original predictor values, as well as the optimal solution using extrapolation control in red, and optimal solution without extrapolation control in blue. And you can see that the red dots appear to be much more within the correlation structure of the original data than the blue, and that's particularly true when you look at this LDL versus total cholesterol. So now let's look at an example using the profiler that's under the graph menu, which I'll call the graph profiler. It also uses the regularized T squared method and it allows us to use extrapolation control on any type of model that can be created and saved as a JSL formula. It also allows us to have extrapolation control on more than one model at a time. So let's look at an example for a company that uses powder metallurgy technology to produce steel drive shafts for the automotive industry. They want to be able to find optimal settings for their production that will minimize shrinkage and also minimize... minimize failures due to bad service conditions. So we have two responses shrinkage (which is continuous and we're going to fit a least squares model for that) and surface condition (which is pass/fail and we're going to fit a nominal logistic model for that one). And our predictor variables are just some key process variables in production. And once againm the extrapolation metric is the regularized T square. So I have the powder metallurgy data open in JMP and I've already fit a least squares model for my shrinkage response, and I've already fit a nominal logistic model for the surface condition pass/fail response, and I've saved the prediction formulas to the data table so that they are ready to be used in the graph profiler. So if I go to the graph menu profiler, I can load up the prediction formula for shrinkage and my prediction formula is for the surface condition. Click OK. And now I have both of my models launched into the graph profiler. And before I turn on extrapolation control, you can see that I have the full profile traces. Once I turn on extrapolation control you can see that the traces shrink a bit, and I'm also going to turn on the details, just to show that indeed I am using the regularized T square here in this method. So what I really want to do is I want to find the optimal conditions where I minimize shrinkage and I minimize failures with extrapolation control and I want to make sure I'm not extrapolating. I want to find a useful solution. And before I can do the optimization, I actually need to set my desirabilities. So I'm going to set desirabilities. It's already correct for shrinkage, but I need to set them for the service condition. I'm going to try to maximize passes and minimize failures. K. And now I should be able to do the optimization with extrapolation controls on. Do maximize and remember. And now I have my optimal solution with extrapolation control on. So now let's look once again at the scatterplot matrix of the original data, along with the solution with extrapolation control on in the solution, with the extrapolation control off. So this is a scatterplot matrix of the powder metallurgy data that I created in JMP. And it also has the optimal solution with extrapolation control as a red dot, and the optimal solution with no extrapolation control as a blue dot. And once again you can see that when we don't enact the extrapolation control, the optimal solution is pretty far outside of the correlation structure of the data. We can especially see that here with ratio versus compaction pressure. So now I want to hand over the presentation to Jeremy to go into a lot more detail about our methods. Hi, so here are a number of goals for extrapolation control that we laid out at the beginning of the project. We needed an extrapolation metric that could be computed quickly with a large number of observations and variables, and we needed a quick way to assess whether the metric indicated extrapolation or not. This was to maintain the interactivity of the profiler traces and we needed this to perform optimization. We wanted to be able to support the various variable types available in the profiler. These are essentially continuous, categorical and ordinal. We wanted to utilize observations with missing cells, because some modeling methods will include these observations in ???. We wanted a method that was robust to linear dependencies in the data. These occur when the number of variables is larger than the number of observations, for example. And we wanted something that was easy to automate without the need for a lot of user input. For least squares models, we landed on leverage, which is often used to identify outliers in linear models. The leverage for new prediction point is computed according to this formula. There are many interpretations for leverage. One interpretation is that it's the multivariate distance of a prediction point from the center of the training data. Another interpretation is that it is a scaled prediction variance. So as prediction point moves further away from the center of the data, the uncertainty of prediction increases. And we use two common thresholds in the statistical literature for determining if this distance is too large. The first is maximum leverage, prediction points beyond this threshold or outside the convex hull of the training data. And the second is 3 times the average of the leverages. It can be shown that this is equivalent to three times the number of model terms divided by the number of observations. And as Laura described earlier, you can change the multiplier of these thresholds. Finally, when desirabilities are being optimized, the extrapolation constraint is a nonlinear constraint, and previously the profiler allowed constrained optimization with linear constraints. This type of optimization is more challenging, so Laura implemented a genetic algorithm. And if you aren't familiar with these, genetic algorithms use the principles of molecular evolution to optimize complicated cost functions. Next, I'll talk about the approach we used to generalize extrapolation control to models other than linear models. When you're constructing a predictive model in JMP, you start with a set of predictor variables and a set of response variables. Some supervised model is trained, and then a profiler can be used to visualize the model surface. There are numerous variations in the profiler in JMP. You can use the profiler internally in modeling platforms. You can output prediction formulas and build a profiler for multiple models. As Laura demonstrated, you can construct profilers for ensemble models. We wanted an extrapolation control method that would generalize all these scenarios, so instead of tying our method to a specific model, we're going to use an unsupervised approach. And we're only going to flag a prediction point as extrapolation if it's far outside where the data are concentrated in the predictor space. And this allows us to be consistent across profilers so that our extrapolation control method will plug into any profiler. The multivariate distance interpretation of leverage suggested Hotelling's T squared as a distance for general extrapolation control. In fact, some algebraic manipulation will show that Hotelling's T squared is just leverage shifted and scaled. This figure shows how Hotelling's T squared measures which ellipse an observation lies on, where the ellipses are centered at the mean of the data, and the shape is defined by the covariance matrix. Since we're no longer in linear models, this metric doesn't have the same connection to prediction variance. So instead of relying on thresholds used back in linear models, we're going to make some distributional assumptions to determine if T squared for prediction point should be considered extrapolation. Here I'm showing the formula for Hotelling's T squared. The mean and covariance matrix is estimated using the training data for the model. If P is less than N, where P is the number of predictors, N is the number of observations and if the predictors of multivariate normal, then T squared for addiction point has an F distribution. However, we wanted a method to generalize the data sets with complicated data types, like a mix of continuous and categorical data sets where P is larger than N, data sets with missing values. So instead of working out the distributions analytically in each case, we used a simple conservative control limit that we found works well in practice. This is a three Sigma control limit using the empirical distribution of T squared from the training data and, as Laura mentioned, you can also tune this multiplier. One complication is that when P is larger than N, Hotelling's T squared is undefined. There are too many parameters in the covariance matrix to estimate with the available data, and this often occurs in typical use cases for extrapolation control like in partial least squares. So we decided on a novel approach to computing Hotelling's T squared, which deals with these cases, and we're calling it a regularized T squared. To compute the covariance matrix we use a regularized estimator originally developed by Schafer and Strimmer for high dimensional genomics data. It's just a weighted combination of the full sample covariance matrix, which is U here and a constraint target matrix which is D. For the Lambda weight parameter, Schafer and Strimmer derived an analytical expression that minimizes the MSE, the estimator asymptotically. Schafer and Strimmer proposed several possible target matrices. The target matrix we chose was a diagonal matrix with the sample variances of the predictor variables on the diagonal. This target matrix has a number of advantages for extrapolation control. First, we don't assume any correlation structure between the variables before seeing the data, which works well as a general prior. Also, when there's little data to estimate the covariance matrix, either due to small N or a large fraction missing, the elliptical constraint is expanded by a large weight on the diagonal matrix, and this results in a more conservative test for extrapolation control. We found this was necessary to obtain reasonable control of the false positive rate. To put this more simply, when there's limited training data, the regularized T squared is less likely to label predictions as extrapolation, which is what you want, because you're more likely to observe covariances by chance. We have some simulation results demonstrating these details, but I don't have time to go into all that. Instead on the Community webpage, we put a link to a paper on archive and we plan to submit this to the Journal of Computational Graphical Statistics. This next slide shows some other important details we needed to consider. We needed to figure out how to deal with categorical variables. We are just converting them into indicator- coded dummy variables. This is comparable to a multiple correspondence analysis. Another complication is how to compute Hotelling's T squared when there's missing data. Several JMP predictive modeling platforms use observations with missing data to train their models. These include naive Bayes and Bootstrap forest. And these formulas are showing the pairwise deletion method we used to estimate the covariance matrix. It's more common to use row wise deletion. This means all observations with missing values are deleted before computing the covariance matrix. And this is simplest, but it can result in throwing out useful data if the sample size of the training data is small. With pairwise deletion observations and deleted only if there are missing values in the pair of variables used to compute the corresponding entry and that's what these formulas are showing. Seems like a simple thing to do. You're just using all the data that's available, but it actually can lead to a host of problems because there are different observations used to compute each entry. This can cause weird things to happen, like covariance matrices with negative eigenvalues, which is something we had to deal with. Here are a few advantages of the regularized T squared we found when comparing to other methods in our evaluations. One is that the regularization works the way regularization normally works. It strikes a balance between overfitting the training data and over biasing the estimator. This makes the estimator more robust to noise and model misspecification. Next, Schafer and Strimmer showed in their paper that regularization results in a more accurate estimator in high dimensional settings. This helps with the cursive dimensionality which plauges most distance based methods for extrapolation control. Then in the fields that have developed the methodology for extrapolation control, often they have both high dimensional data and highly correlated predictors. For example in cheminformatics and chemometrics, the chemical features are often highly correlated. Extrapolation control is often used in combination with PCA and PLS models, where T squared DModX are used to detect violations of correlation structure. This is similar to what we do in model driven multivariate control chart. Since this is a common use case, we wanted to have an option that didn't deviate too far from these methods. Our regularized T squared provides the same type of extrapolation control, but it doesn't require projection step which has some advantages. We found that this allows us to better generalized other types of predictive models. Also, in our evaluations we observed that if a linear projection doesn't work well for your data, like you have nonlinear relationships between predictors, the errors can inflate the control limits of projection based methods, which will lead to poor protection against extrapolation, and our approach is more robust than this. And then another important point is that we found the single extrapolation metric was much simpler to use and interpret. And here is a quick summary of the features of extrapolation control. The method provides better visualization of feasible regions in high dimensional models in the profiler. A new genetic algorithm has been implemented for flexible constrained optimization. Our regularized T squared handles messy observational data, cases like P larger than N, and continuous and categorical variables. The method is available in most of the predictive models in JMP 16 Pro and supports many of their idiosyncracies. It's also available in the profiler in graph, which really opens up its utility because you can operate on any prediction formula. And then as a future direction, we're considering implementing a K-nearest neighbor based constraint that would go beyond the current correlation structure constraint. Often predictors are generated by multiple distributions resulting in clustering in the predictor space. And a K-nearest neighbors based approach would enable us to control extrapolation between clusters. So thanks to everyone who tuned in to watch this and here are our emails if you have any further questions.  
Ron Kenett, Chairman, KPA Group and Samuel Neaman Instiute, Technion Christopher Gotwalt, JMP Director, Statistical R&D, SAS   Data analysis – from designed experiments, generalized regression to machine learning – is being deployed at an accelerating rate. At the same time, concerns with reproducibility of findings and flawed p-value interpretation indicate that well-intentioned statistical analyses can lead to mistaken conclusions and bad decisions. For a data analysis project to properly fulfill its goals, one must assess the scope and strength of the conclusions derived from the data and tools available. This focus on statistical strategy requires a framework that isolates the components of the project: the goals, data collection procedure, data properties, the analysis provided, etc. The InfoQ Framework provides structured procedures for making this assessment. Moreover, it is easy to operationalize InfoQ in JMP. In this presentation we provide an overview of InfoQ along with use case studies, drawing from consumer research and pharmaceutical manufacturing, to illustrate how JMP can be used to make an InfoQ assessment, highlighting situations of both high and low InfoQ. We also give tips showing how JMP can be used to increase information quality by enhanced study design, without necessarily acquiring more data. This talk is aimed at statisticians, machine learning experts and data scientists whose job it is to turn numbers into information.     Auto-generated transcript...   Speaker Transcript   Hello.   My name is Ron Kenett. This   is a joint talk with Chris   Gotwalt and we basically   have two messages   that should come out of the   talk. One is that we should   really be concerned about   information and information   quality. People tend to talk   about data and data quality, but   data is really not the the issue.   We are statisticians. We are   data scientists. We turn numbers,   data into information so that   our goal should be to make sure   that we generate high quality   information. The other message   is that JMP can help you   achieve that, and this is   actually turning out to be in   surprising ways. So by combining   the expertise of Chris and an   introduction to information   quality, we hope that these two   messages will come across   clearly. So if I had to   summarize what it is that we   want to talk about, after all,   it is all about information   quality. I gave a talk at   at the Summit in Prague four years   ago and that talk was generic.   It talked about moving the   journey from quality by design.   My journey doing information   quality. In this talk we focus   on how this can be done with   JMP. This is a more detailed   and technical talk than the   general talk I gave in Prague.   You can watch that talk.   There's a link listed here. You can   find it on the JMP community.   So we're going to talk about   information quality, which is   the potential of the data set, a   specific data set, to achieve a   particular goal, a specific goal,   with the given empirical method.   So in that definition we have   three components that are   listed. One is a certain data   set. Here is the data. The   second one is the goal,   the goal of the analysis, what   it is that we want to achieve.   And the third one is how we will   do that, which is, with what   methods we're going to generate   information, and that potential   is going to be assessed with the   utility function. And I will   begin with an introduction to   information quality, and then   Chris will will take over,   discuss the case study and and   show you how to conduct an   information quality assessment.   Eventually this should   answer the question how JMP   supports InfoQ, that   would be the the bullet points   that you can...the take away points   from the talk. The setup for   this is that we we encourage   what I called a lifecycle view   of statistics. In other words,   not just data analysis.   We should know...we should be part   of the problem elicitation   phase. Also, the goal   formulation phase, that deserves   a discussion. We should   obviously be involved in the   data collection scheme if it's   through experiments or through   surveys or through observational   data. Then we should also take   time for formulation of the   findings and not just pull out   printed reports on on   regression coefficients   estimates and and their   significance, but we should   also discuss what are the   findings? Operationalization of   findings meaning, OK, what can we   do with these findings? What are   the implications of the   findings? This should should...   needs to be communicated to the   right people in the right way,   and eventually we should do an   impact assessment to figure out,   OK, we did all this; what has   been the added value of our   work? I talked about the life   cycle of your statistics a few years   ago. This is the prerequisite,   the perspective to what   I'm going to talk about. So as I   mentioned, the information   quality is the potential of a   particular data set to achieve a   particular goal using given   empirical analysis methods. This   is identified through four components the goal, the data,   the analysis method, and the   utility measure. So in a in a   mathematical expression, the   utility of applying f to x,   condition on the goal, is how we   identify InfoQ, information   quality. This was published in   the Royal Statistical Society   Series A in 2013 with eight   discussants, so it was amply   discussed. Some people thought   this was fantastic and some   people had a lot of critique on   that idea, so this is a wider   scope consideration of what   statistics is about. We also   wrote in 2006, we meeting myself   and Galit Shmueli, a book called   Information Quality. And in   the context of information   quality we did what is called   deconstruction. David Hand has   a paper called Deconstruction   of Statistics. This is the   deconstruction of information   quality into eight dimensions. I   will cover these eight dimensions.   That's my part in the talk and   then Chris will show how this   is implemented in a specific   case study.   Another aspect that relates to   this is another book I have.   This is recent, a year ago   titled The Real Work of Data   Science and we talk about what   is the role of the data   scientists in organizations and   in that context, we emphasized   the need for the data scientist   to be involved in the generation   of information as an...information   quality as meeting the goals of   the organization. So let me   cover the eight dimensions. That's   that's that's my intro. The   first one is data resolution. We   have a goal. OK, we we would   like to know the level of flu   because in the country or in the   area where we live, because that   will impact our decision on   whether to go to the park where   we could meet people or going to   a...to a jazz concert. And that   concert is tomorrow.   If we look up the CDC data on   the level of flu, that data is   updated weekly, so we could get   the red line in the graph you   have in front of you, so we   could get data of a few days   ago, maybe good, maybe not good   enough for our gold. Google Flu,   which is based on searches   related to flu, is updated   momentarily, so it's updated   online, it will probably give   us better information. So for   our goal, the blue line, the   Google trend...the Google Flu   Trends indicator, is probably   more appropriate. The second   dimension is data structure.   To meet our goal, we're going to   look at data. We should...we   should identify the data sources   and the structure in these data   sources. So some data could be   text, some could be video, some   could be, for example, the   network of recommendations. This   is an Amazon picture on how if   you look at the book, you're   going to get some   other books recommended. And if   you go to these other books,   you're going to have more data   recommended. So the data   structure can come in all sorts   of shapes and forms and this can   be text. This can be functional   data. This can be images. We are   not confined now to what we used   to call data, which is what you   find in an Excel spreadsheet.   The data could be corrupted,   could have missing values, could   have unusual patterns which   which would be   something to look into. Some   patterns, where things are   repeated. Maybe some of the data   is is just copy and paste and we   would like to be warned about   such options. The third   dimension is data integration.   When we consider the data from   these different sources, we're   going to integrate them so we   can do some analysis linkage   through an ID. For example, we   would do that, but in doing that,   we might create some issues, for   example, in in disclosing data   that normally should be   anonymized. Data   integration, yeah, that will   allow us to do fantastic things,   but if the data is perceived to   have some privacy exposure   issues, then maybe the quality   of the information from the   analysis that we're going to do   is going is going to be   affected. So data integration   should be looked into very, very   carefully. This is what people   likely used to call ETL   extract, transform and load. We   now have much better methods for   doing that. The join option, for   example, in JMP will offer   options for for doing that.   Temporal relevance. OK, that's   pretty clear. We have data. It is   stamped somehow. If we're going   to do the analysis later, later   after the data collection, and   if the deployement that we   consider is even later, then the   data might not be temporally   relevant. In a common   situation, if we want to compare   what is going on now, we would   like to be able to make this   comparison to recent data or   data before the pandemic   started, but not 10 years   earlier. The official statistics   on health records used to be two   or three years behind in terms   of timing, which made it very   difficult the use of official   statistics in assessing   what is going on with the   pandemic. Chronology of data and   goal is related to the decision   that we make as a result of our   goal. So if, for example, our   goal is to forecast air quality,   we're going to use some   predictive models on the Air   Quality Index reported on a   daily basis. This gives us a one   to six scale from hazardous to   good. There are some values   which are representing levels of   health concern. Zero-50 is good;   300-500 is hazardous and the   chronology of data and goal   means that we should be able to   make a forecast on a daily   basis. So the methods we use   should be updated on a daily   basis. If, on the other hand,   our goal is to figure out how is   this AQI index computed, then we   are not really bound by the the   the timeliness of the analysis.   You know, we could take our   time. There's no urgency in   getting the analysis done on a   daily basis. Generalizability,   the sixth dimension, is about   taking our findings and   considering where this could   apply in more general terms,   other populations, all   situations. This can be done   intuitively. Marketing managers who   have seen a study on the on the   market, let's call it A, might   already understand what are the   implications to Market B   without data. People who are   physicists will be able to   make predictions based on   mechanics on first principles   without without data.   So some of the generalizability   is done with data. This is the   basis of statistical   generalization, where we go from   the sample to the population.   Statistical inferences is about   generalization. We generalize   from the sample to the   population. And some can be   domain based, in other words,   using expert knowledge, domain   expertise, not necessarily with   data. We have to recognize that   generalizability is not just   done with statistics.   The seventh dimension is   construct operationalization,   which is really about what it is   that we measure. We want to   assess behavior, emotions, what   it is that we can measure, that   will give us data that reflects   behavior or emotions.   The example I give here   typically is pain.   We know what is pain. How do   we measure that? If you go to a   hospital and you ask the nurse,   how do you assess pain, they   will tell you, we have a scale,   1 to 10. It's very   qualitative, not very   scientific, I would say. If we   want to measure the level of   alcohol in drivers on the...on   the road, it will be difficult to   measure. So we might measure   speed as a surrogate measure.   Another part of   operationalization is the other   end of the story. In other   words, the first part, the   construct is what we measure,   which reflects our goal. The the   end...the end result here is that   we have findings and we want to   do something with them. We want   to operationalize our finding.   This is what the action   operationalization is about.   It's about what you do with the   findings and then being   presented here on a podium. We   used to ask three questions.   These are very important   questions to ask. Once you have   done some analysis, you have   someone in front of you who   says, oh, thank you very much,   you're done...you, the statistician   or the data scientist. So this   this takes you one extra step,   getting you to ask your customer these simple questions What do   you want to accomplish? How will   you do that and how will you   know if you have accomplished   that? We we can help answer, or   at least support, some of these   questions we've answered.   The eighth dimension is   communication. I'm giving you an   example from a very famous old   map from the 19th century, which   is showing the march of the   Napoleon Army from France to   Moscow to Russia. You see the   numbers are the the width of   the path indicates the size   of the army, and then on on in   black you see what happened to   them on their on their way back.   So basically this was a   disastrous march. We we we we   can relate this old map to   existing maps, and there is a   JMP add-in, which you can   find on the JMP Community, to to   show you with bubble plots,   dynamic bubble plots what this   looks like. So I I've covered   very quickly the eight information   quality dimensions. My last   slide is that what I've talked   about from a historical   perspective, really put some   proportions to what I'm saying.   I think we are really in the era   of information quality. We used   to be concerned with product   quality in the 18th century, the   17th century. We then moved to   process quality and service   quality. This is a short memo   on proposing a control chart,   1924, I think.   Then we move to management   quality. This is the Juran   trilogy of design, improvement   and control. Six Sigma (define,   measure, analyze, improve)   control process is the   improvement process of Juran,   and Juran was the grand father   of Six Sigma in that sense.   Then in the '80s, Taguchi came   in. He talks about robust   design. How can we handle   variability in inputs by proper   design decisions? And now we are   in the Age of information   quality. We have sensors. We   have flexible systems. We are   depending on AI and machine   learning and data mining and we   are gathering big big big   numbers, but which we call big   data. The interest in information   quality should be a prime prime   interest. I'm going to try and   convince you of, with the help   of Chris, that.   We are here and JMP can   help us achieve that in in a   really unusual way.   What you will see at the end of   the case study that Chris will   show is also how to do it an   information quality assessment and   on a specific study, basically   generate an information quality   score. So if we go top down, I   can tell you this study, this   work, this analysis is maybe 80% or   maybe 30% or maybe 95%.   And through the example you will   see how to do that. There is a   JMP add-in to provide this   assessment. It's it's actually   quite quite easy. There's   nothing really sophisticated   about that. So I'm done and   Chris, after you. Thanks, Ron. So   now I'm going to go through the   analysis of a data set in a way   that explicitly calls out the   various aspects of information   quality and show how JMP can be   used to assess an improvement   for InfoQ. So first off, I'm   going to go through the InfoQ   components. The first InfoQ   component is the goal, so in   this case the problem statement   was that a chemical company   wanted a formulation that   maximized product yield while   minimizing a nuisance impurity   that resulted from the reaction.   So that was the high level goal.   In statistical terms, we wanted   to find a model that accurately   predicted a response on a data   set so that we could find a   combination of ingredients and   processing steps that would lead   to a better product.   The data are set up in 100   experimental formulations with   one primary ingredient, X1,and   10 additives. There's also a   processing factor in 13   responses. The data are   completely fabricated but were   simulated to illustrate the same   strengths and weaknesses of the   original data. The data   formulation was made was also   recorded. We will be looking at   this data closely, so I want to   elaborate beyond pointing out   that they were collected in an   ad hoc way, changing one or two   additives at a time rather than   as a designed or randomized   experiment. There's a lot of   ways to analyze this data, the   most typical being least   squares modeling with forward   selection on selected responses.   That was my original intention   for this talk, but when I showed   the data to Ron, he immediately   recognized the response columns   as time series from analytical   chemistry. Even though the data   were simulated, he could see the   structure. He could see things   in the data that I didn't see   and read into it wasn't. I found   this to be strikingly   impressive. It's beyond the   scope of this talk, but there is   an even better approach based on   ensemble modeling using   fractionally weighted   bootstrapping. Phil Ramsey,   Wayne Levin and I have another   talk about this methodology at   the European Discovery   Conference this year. The   approach is promising because it   can fit models to data with   more active interactions than   there are runs. The fourth and final   component of information quality   is utility, which is how well we   are able to assess our goals. Or   how do we measure how well we've   assessed our goals? There's a   domain aspect which is in this   case we want to have a   formulation that leads to   maximized yields and minimized the   waste in post processing of the   material. The statistical   analysis utility refers to the   model that we fit. What we're   going for there is least   squares accuracy of our model in   terms of how well we're able to   predict what the...what would   result from candidate   combinations of formulation...of   mixture factors. Now I'm going   to go through a set of questions   that make up a detailed InfoQ   assessment as organized into the   eight dimensions of information   quality. I want to point out   that not all questions will be   equally relevant to different   data science and statistical   projects, and that this is not   intended to be rigid dogma but   rather a set of things that are   a good idea to ask oneself.   These questions represent a kind   of data analytic wisdom that   looks more broadly than just the   application of a particular   statistical technology. A copy   of a spreadsheet with these   questions along with pointers to   JMP features that are the most   useful for answering a   particular one will be uploaded   to the JMP Community along   with this talk for you to use. As   I proceed through the questions,   I'll be demoing an analysis of   the data in JMP. So Question 1 is is the data scale used   aligned with the stated goal? So   the Xs that we have consist of   a single categorical variable   processing and the 11 continuous   inputs. These are measured   as percentages and are also   recorded to half a percent. We   don't have the total amounts of   the ingredients, only the   percentages. The totals are   information that was either lost   or never recorded. There are   other potentially important   pieces of information that are   missing here. The time between   formulating the batches and   taking the measurements is gone   and there could have been other   covariate level information that   is missing here that would have   described the conditions under   which the reaction occurred.   Without more information than I   have, I cannot say how important   this kind of covariate information   would have been. We do have   information on the day of the   batch, so that could be used as   a surrogate possibly. Overall we   have what are, hopefully, the most   important inputs, as well as   measurements of the responses we   wish to optimize. We could have   had more information, but this   looks promising enough to try   and analysis with. The second   question related to data   resolution is how reliable and   precise are the measuring devices   and data sources. And the fact   is, we don't have a lot of   specific information here. The   statistician internal the   company would have had more   information. In this case we   have no choice but to trust that   the chemists formulated and   recorded the mixtures well. The   third question relative to data resolution is is the data   analysis suitable for the data   aggregation level? And the   answer here is yes, assuming   that their measurement system is   accurate and that the data are   clean enough. What we're going   to end up doing actually is   we're going to use the   Functional Data Explorer to   extract functional principal   components, which are a data   derived kind of data   aggregation. And then we're   going to be modeling those   functional principal components   using the input variables. So   now we move on to the data   structure dimension and the   first question we ask is, is the   data used aligned with the   stated goal? And I think the   answer is a clear yes here. We're   trying to maximize   yield. We've got measurements for   that, and the inputs are   recorded as Xs. The second data   structure question is where   things really start to get   interesting for me. So this is   are the integrity details   (outliers, missing values, data   corruption) issues described and   handled appropriately? So from   here we can use JMP to be able   to understand where the outliers   are, figure out strategies for   what to do about missing values,   observe their patterns and so   on. So this is this is where   things are going to get a little   bit more interesting. The first   thing we're going to do is we're   going to determine if there are   any outliers in the data that we   need to be concerned about. So   to do that, we're going to go   into the explore outliers   platform off of the screening   menu. We're going to load up the   response variables, and because   this is a multivariate setting,   we're going to use a new feature   in JMP Pro 16 called Robust   PCA Outliers. So we see where   the large residuals are in those   kind of Pareto type plots.   There's a snapshot showing where   there's some potentially   unusually large observations. I   don't really think this looks   too unusual or worrisome to me.   We can save the large outliers   to a data table and then look at   them in the distribution   platform and what we see kind of   looks like a normal distribution   with the middle taken out. So I   think this is data that are   coming from sort of the same   population and there's nothing   really to worry about here,   outliers-wise. So once we've   taken care of the outlier   situation we go in and explore   missing values. So what we're   going to do first is we're going   to load up the Ys as...into the   platform, and then we're going   to use the missing value   snapshot to see what patterns   they are amongst our missing   values. It looks like the   missing values tend to occur in   horizontal clusters, and there's   also the same missing values   across rows. So you can see that   with the black splotches here.   And then we'll go apply an   automated data imputation,   which goes ahead and saves   formula columns that impute   missing values in the new   columns using a regression type   algorithm that was developed by   a PhD student of mine named Milo   Page at NC State. So we can play   around a little bit and get a   sense of like how the ADI   algorithm is working. So it's   created these formula columns   that are peeling off elements of   the ADI impute column, which is   a vector formula column, and the   scoring impute function   is calculating the expected   value of the missing cells given   the non missing cells, whenever   it's got a missing value. And it's   just carrying through a non   missing value. So you can see 207   in YO...Y6 there. It's initially   207 but then I change it to   missing and it's now being   imputed to be 234.   So I'll do this a couple of times so   you can kind of see how how it's   working. So here I'll put in a big   value for Y7 and that's now.   been replaced. And if we go down   and we add a row,   then all missing values are there   initially and the column means   are replaced for the   imputations. If I were to go   ahead and add values for some of   those missing cells, it would   start doing the conditional   expectation of the still missing   cells using the information   that's in the missing one....the   non missing ones. So our next   question on data structure is   are the analysis methods   suitable for the data structure?   So we've got 11 mixture inputs   and a processing variable that's   categorical. Those are going   to be inputs into a least   squares type model. We have   13 continuous responses and   we can model them using...   individually using least   squares. Or we can model   functional principal   components. The...now there are   problems. The Xs are...the   input variables have not been   randomized at all. It's very   clear that they would muck   around with one or more of   the compounds and then move   on to another one. So the   order in which the   the input variables were varied   was kind of haphazard. It's a   clear lack of randomization, and   that's going to negatively   impact our...the generalizability   and strength of our conclusions.   Data integration is the third   InfoQ dimension. These data   are manually entered lab notes   consisting mostly of mixture   percentages and equipment   readouts. We can only assume   that the data were entered   correctly and that the Xs are   aligned properly with responses.   If that isn't the case, then the   model will have serious bias   problems and have   problems with generalizability.   Integration is more of an issue   with observational data   science problems in machine   learning exercises, than lab   experiments like this. Although   it doesn't apply here, I'll   point out that privacy and   confidentiality concerns can be   identified by modeling the   sensitive part of the data using   the to be published component   at the data. If the resulting   model is predicted, then one   should be concerned that privacy   concerns are not being met. Temporal   relevance refers to the   operational time sequencing of   data collection, analysis and   deployment and whether gaps   between those stages leads to a   decrease in the usefulness of   the information in the study.   In this case, we can only simply   hope that the material supplies   are reasonably consistent and   that the test equipment is   reasonably accurate, which is an   unverifiable assumption at this   point. The time resolution we   have when the data collection is   at the day level, which means   that there isn't much way that   we can verify if there is time   variation within each day.   Chronology of data and goal is   about the availability of   relevant variables both in terms   of whether the variable is   present at all in the data or   whether the right information   will be available when the model   is deployed. For predictive   models, this relates to models   being fit to data similar to   what will be present at the time   the model will be evaluated on   new data. In this way, our data   set is certainly fine. For   establishing causality, however,   we aren't in nearly as good a   shape because the lack of   randomization implies that time   effects and factor effects may   be confounded, leading to bias   in our estimates. Endogeneity,   or reverse causation, could   clearly be an issue here, as   variables like temperature and   reaction time could clearly be   impacting the responses, but have   been left unrecorded. Overall,   there is a lot we don't know   about this dimension in an   information quality sense.   The rest of the InfoQ   assessment is going to be   dependent upon the type of   analysis that we do. So at this   point I'm going to go ahead and   conduct an analysis of this data   using the Functional Data   Explorer platform in JMP Pro   that allows me to model across   all the columns simultaneously   in a way that's based on   functional principal components,   which contain the maximum amount   of information across all those   columns as represented in the   most efficient format possible.   I'm going to be working on the   imputed versions of the columns   that I calculated earlier in the   presentation. And I'm going to   point out that I'm going to be   working to find combinations of   the mixture factors that achieve   as closely as possible in a   least square sense, an ideal   curve that was created by the   practitioner that maximizes the   amount of potential product that   could be in a batch while   minimizing the amount of the   impurities that they   realistically thought a batch   could contain. So I begin the   analysis by going to the analyze   menu, bring up the Functional   Data Explorer. This has rows as   functions. I'm going to load up my   imputed rows, and then I'm going   to put in my formulation   components and my processing   column as a supplementary   variable. We've got an ID   function, that's batch ID. Here I   get in. I can see the functions,   both the overlay altogether, and   I can see the individual functions.   Then I can load up the target   function, which is the ideal.   And that will change the   analysis that results once I   start going into the modeling   steps. So these are pretty   simple functions, so I'm just   going to model them with   B splines.   And then I'm going to go into my   functional DOE analysis.   This is going to fit the model   that connects the inputs into   the functional principal   components and then connect all   the way through the   eigenfunctions to make it so   we're able to recover the   overall functions as they   changed, as we are varying the   mixture factors. The   functional principal component   analysis has indicated that   there are four dimensions of   variation in these response   functions. To understand what   they mean, let's go ahead and   explore with the FPC profiler.   So watch this pane right here as   I adjust FPC 1 and we can see   that this FPC is associated with   peak height. FPC2, it looks   like it's kind of a peak   narrowness. It's almost like a   resolution principal   component. The third one is   related to kind of a knee on   the left of the dominant peak.   And Peak 4 looks like it's   primarily related with the   impurity, so that's what the   underlying meaning is of   these four functional   principal components.   So we've characterized our goal   as maximizing the product and   minimizing the impurity, and   we've communicated that into the   analysis through this ideal or   golden curve that we supplied at   the beginning of the FDE   exercise we're doing. To get as   close as possible to that ideal   curve, we turn on desirability   functions. And then we can go   out and maximize desirability.   And we find that the optimal   combination of inputs is about   4.5% of   Ingredient four, 2% of   Ingredient 6. 2.2% of   Ingredient 8 and 1.24% of   Ingredient 9 using processing   method two. Let's review how   we've gotten here. We first   computed the missing response   columns. Then we found B-spline   models that fit those functions   well in the FDE platform. A   functional principal components   analysis determined that there   were four eigenfunctions   characterizing the variation in   this data. These four   eigenfunctions were determined   via the FPC profiler to each   have a reasonable subject   matter meaning. The functional   DOE analysis consisted of   applying pruned forward   selection to each of the   individual FPC scores using the   DOE factors as input variables.   And we see here that these have   found combinations of   interactions and main effects   that were most predictive for   each of the functional principal   component scores individually.   The Functional DOE Profiler   has elegantly captured all   aspects into one representation   that allows us to find the   formulation processing step that   is predicted to have desirable   properties as measured by high   yield and low impurity.   So now we can do an InfoQ   assessment of the   generalizability of the data in   the analysis. So in this case,   we're more interested in   scientific generalizability, as   the experimenter is a deeply   knowledgeable chemist working   with this compound. So we're   going to be relying more on   their subject matter expertise   then on statistical principles   and tools like hypothesis tests   and so forth. The goal is   primarily predictive, but the   generalizability is kind of   problematic because the   experiment wasn't designed. Our   ability to estimate interactions   is weakened for techniques like   forward selection and impossible   via least squares analysis of   the full model. Because the   study wasn't randomized, there   could be unrecorded time in   order effects. We don't have   potentially important covariate   information like temperature and   reaction time. This creates   another big question mark   regarding generalizability.   Repeatability and   reproducibility of the study is   also an unknown here as we have   no information about the   variability due to the   measurement system. Fortunately,   we do have tools like JMP's   evaluate design to understand   the existing design as well as   augment design that can greatly   enhance the generalization   performance of the analysis.   Augment can improve information   about main effects and   interactions, and a second round   of experimentation could be   randomized to also enhance   generalizability. So now I'm   going to go through a couple of   simple steps to show how to   improve the generalization   performance of our study using   design tools in JMP. Before I   do that, I want to point out   that I had to take the data and   convert it so that it was   proportions rather than in   percents. Otherwise the design   tools were not really agreeing   with the data very well. So we   go into the evaluate designer   and then we load up our Ys and   our Xs. I requested the ability to   handle second order interactions.   Then...yeah, I got this alert   saying, hey, I can't do that   because we're not able to   estimate all the interactions   given the one factor at a time   data that we have. So I backed   up. We go to the augment   designer, load everything up,   set augment. We'll choose and I-   optimal design because we're   really concerned with   predicted performance here.   And I   set the number of runs to 148.   The custom designer requested   141 as a minimum, but I went to   148 just to kind of make sure   that we've got all ability to   estimate all of our interactions   pretty well. After that, it   takes about 20 seconds to   construct the design. So now   that we have the design, I'm   going to show the two most   important diagnostic tools in   the augment designer for   evaluating a design. On the   left, we have the fraction of   design space plot. This is   showing that 50% of the volume   of the design has   a prediction variance that is   less than 1. So 1 would be   equivalent to the residual   error. So we're able to get   better than measurement error   quality predictions over the   majority of the design. On the   right we have the color map on   correlations. This is showing   that we're able to estimate   everything pretty well. There's   some...because of the mixture   constraint, we're getting some   strong correlations between   interactions and main effects.   Overall, the effects are fairly   clean. And the interactions are   pretty well separated from one   another, and the main effects   are pretty well separated from   one another as well. After   looking at the design   diagnostics, we can make the   table. Here, I have shown the   first 13 of the augmented runs   and we see that we've got...we   have more randomization. We don't   have use of the same main effect   over and over again streaks.   That's evidence of better   randomization and overall the   design is going to be able to   much better estimate the main   effects and interactions having   received better, higher quality   information in this second stage   of experimentation. So the input   variables, the Xs, are accurate   representations of the mixture   proportion, so that's a clear   objective interest. The   responses are close surrogate   for the amount of the product   and amount of impurity that's in   the batches. We're pretty good on   7.1. there. The justifications   are clear. After the study, we   can of course go prepare a   batch that is the formulation   that was recommended by the FDOE   profiler. Try it out and see if   we're getting the kind of   performance that we were looking   for. It's very clear that that   would be the way that we can   assess how well we've achieved   our study goals. So now under the   last InfoQ dimension   Communication. By describing the   ideal curve as a target   function, the Functional DOE   Profiler makes the goal and the   results of the analysis crystal   clear. But this can be expressed   at a level that is easily   interpreted by the chemists and   managers of the R&D facility.   And as we have done our detailed   information quality assessment,   we've been upfront about the   strengths and weaknesses of the   study design and data   collection. If the results do   not generalize, we certainly   know where to look for where the   problems were. Once you become   familiar with the concepts,   there is a nice add-in written   by Ian Cox that you can use to   do a quick quantitative InfoQ   assessment. The add-in has   sliders for the upper and lower   bounds of each InfoQ dimension.   These dimensions are combined   using a desirability function   approach for an overall interval   for the InfoQ over on the left.   Here is an assessment for the   data and analysis I covered in   this presentation. The add-in is   also a useful thinking tool that   will make you consider each of   the InfoQ dimensions. It's also a   practical way to communicate   InfoQ assessments to your   clients or to your management, as   it provides a high level view of   information quality without   using a lot of technical   concepts and jargon. The add-in   is also useful as the basis for   an InfoQ comparison. My   original hope for this   presentation was to be a little   bit more ambitious. I had hoped   to cover the analysis I had   just gone through, as well as   another simpler one, one where I skip   inputing the responses and doing   a simple multivariate linear   regression model of the response   columns. Today, I'm only able to   offer a final assessment of that   approach. As you can see,   several of the InfoQ   dimensions suffer substantially   without the more sophisticated   analysis. It is very clear that   the simple analysis leads to   much lower InfoQ score.   The upper limit of the simple   analysis isn't that much higher   than the lower limit of the more   sophisticated one. With   experience, you will gain   intuition about what a good InfoQ   score is for data science   projects in your industry and   pick up better habits as you   will no longer be blind to the   information bottlenecks in your   data collection, analysis and   model deployment. Information   quality with an easy to use   interface. This was my first   formal information quality   assessment. Speaking for myself,   the information quality   framework has given words and   structure to a lot of things I   already knew instinctively. It   is already changed how I   approach new data analysis   projects. I encourage you to go   through this process yourself on   your own data, even if that data   and analysis is already very   familiar to you. I guarantee   that you will be a wiser and   more efficient data scientist   because of it. Thank you.
Christel Kronig, Senior Analytical Scientist, Dr Reddy's Laboratories EU Ltd Andrea Sekulovic, Scientist Formulation Process Development, Dr. Reddy's Laboratories Ltd.   A key aspect of the development of generic drugs is that sameness to the innovator product must be demonstrated. In this study, the objective was optimisation of a milling process to generate a drug product with particle size attributes all in the same range as that of the innovator product. Two different modelling techniques were evaluated to model the particle size attributes over time: Functional Data Explorer versus Fit Curve. The curve parameters or Functional Principle Components (FPC)  were then modeled as a function of the process parameters. Finally, the models obtained were used to predict the particle size attributes over time and identify combinations of process parameters likely to generate a drug product of the desired quality. A verification experiment was performed which resulted in a product with particle size attributes matching the requirements.     Auto-generated transcript...   Speaker Transcript Christel Kronig Well hi everyone, and thank you for joining this talk. My name is Christel Kronig and I'm a scientist at Dr. Reddy's in Cambridge in the UK. I   helped with the data analysis on this project. My colleague, who also worked on this study, is Andrea Sekulovic and she's based at Dr Reddy's in the Netherlands and she's a formulation scientist.   And so today I'm going to talk to you about the optimization of a milling process to match the drug product quality attributes of the innovator.   And so the first part of the presentation will be will be talking to you about the process development, of the objective of the project and what the study involved and   and what modeling options we considered, and then the second part, I will look at the workflow that we developed in JMP for this study.   Okay, so and the objective of this study was really to understand the relationship between the process parameter for milling process   and any quality attributes for the drug product that we're making, so our responses. So we wanted to obtain a predictive model that we could use for scale up and also to optimize the conditions that we would need for this process.   So there were several responses that we had to examine as part of the study and they are particle size attributes.   So we looked at micron and span and by studying the innovator product, we also knew which...what the range needed to be to make sure we had a product that was within the specification and similar to the innovator product.   So the profile of the responses would vary, based on the milling time and the milling process parameters to find the optimum conditions we needed to optimize these parameters to make sure that we would have product in the design range that met the specification requirement.   milling speed, the flow, the size, we also had a loading parameter, excipient percent, API concentration and, of course, the time for milling.   So the process development was...   we started with making some initial batches. We didn't start directly with doing a design of experiments. We   looked at the data that needed to be collected with those first few batches that were made and we looked at modeling options. And then the team   in the Netherlands decided to do an I-optimal design, so they looked at three parameters and time and perform some initial modeling,   after those first data sets, decided to add an extra three additional parameters. So we documented this design, added 10 additional experiments. So the final data set that we   looked at for the optimization had 38 batches that included six parameters and time, and this is what we use for the optimization for this study. So after that we then made some confirmation batches to check if the   new settings would generate products that meet the requirements.   Okay, so what modeling options did we consider   for this project? So the default option for the team was really to model the response at selected time points.   And it's easy to do that in the standard software and the disadvantage of course is it's not possible to predict the outcome   at other time points and the optimum may be in between specific time points. So modeling of the profile over time enables greater understanding of how the process parameters affect profile of the response over time,   so you're more likely to reach an optimum. But for this, of course, you need more advanced modeling capabilities.   And so we looked at first fit curve, which is available in JMP.   And for our initial data set that works quite well, so this is one of the functions, this Biexponential 4,   that we used appeared to be a good fit for most of the batches that we've made initially when you know modeling some of the responses and there's an example on the right of how the this type of curve fitted our data quite well.   And one of the issues we encountered is that it didn't work for all the batches so, for example, in some cases we didn't have enough time points. On the left there's not enough time points   to see...to fit that model. You would need a minimum five, for example, for this particular model. On the right we have one way we have enough time point but that particular type of curve doesn't fit this particular model very well.   So it, it was difficult using this for the larger data set that we had, and so, for that reason we didn't continue with that approach.   And we also looked at Functional Data Explorer.   So you can see here, looking at this platform for 10 different batches, you can see those on the left and how the profile was fitted...   profile over time, so there was no issue with lack of data points here.   This gave a good view of differences between batches. So, for example,   in green on the graph in the right hand side, you can see the fast batch is in this part of the space and the slow batch appears in a different part of the space and that highlights the difference   between those profiles which is perhaps not so obvious when you look at the graph on the left hand side.   So what you get with the Functional Data Explorer is   it breaks down the profile over time into different principal components, so the FPC values that we see here,   and this is what you then use for the next part of the workflow. So this is available in JMP Pro; I forgot to say this. So, first, before I show you what it looks like in JMP, I just wanted to take you through   what the workflow looks like, what we are trying to do. So we're starting with...apologies for that...we're starting with   a timetable which has this critical quality attributes, responses, we have our time points and then for each of those time points,   and yes, the critical quality attributes, and also the process parameters that we'll use to generate the batches.   So you then take that data and you use FDE to get a model, and this will mean that you can express your CQAs, your responses, as a function of time and those functional principal components FBCs.   So the output of that will be that you then get a summary table. For each batch you have CPPs and the FPC, functional principal component, and then you can apply standard modeling   to then get the predictive model to express those FPCs as a function of your process parameters.   And then the final step is to import that model back into the original table, so you have then...you can express then your responses as a function of time and your process parameters,   which allows you then to use this to find a optimum conditions and to make confirmation batches using those models.   And so, and what I've got to say is that for modeling I'll show you this in JMP, we use this model validation strategy for designed experiments, that's something I was presented   three years ago at Discovery Summit. I won't go into the details of that, but that's there for reference if you want to look that up.   So okay so I'll now take you through what this workflow looks like in JMP and I'll switch over to JMP so...just find my   JMP journal.   Okay, going to move this here.   Okay, so we'll first start with the   original data table, so we have a number of batches that were made, so 38 batches. Each batch has   a number of time points. For example, the first batch here, we have 10 different time points. You have your six columns which are the process parameters.   And then we have two responses for each of those   data points. So first thing to show you is to look at the data table and visualize that data set, what what that looks like. So I have a script here using Graph Builder, which gives very quickly a good overview of   what the data looks like. So, for example, you have one of the responses   with the milling time here at the bottom, and you can straightaway see that the profiles are quite different, depending on the batches, some are steeper than the others, some are very shallow and also some were collected over a longer or shorter period of time.   So we'll now look at the profile in a bit more detail and look at the two modeling approach that we   talked about previously in the slides.   So the first one is using the fit curve, so that's under the analyze platform, under specialized modeling.   So if I select fit curve.   I'm going to pick my milling time with my X and one of my responses, and then I'm going to select batch. So I'm going to do that for each batch. Click OK. I then have a...for each batch I have a profile, the response over time.   And then I'm going to   use one of the models that is stored, you know, that JMP has already, and this is this biexponential 4P, which I   talked about before, so I won't go through the differences, you know the different models, but I know this is one of the ones that we looked at previously for our data. So for example, for this batch it fits okay, but not not brilliantly for some of the data points.   For this one, there's not enough time points so you don't,   you know, you can't really use that, but for some of the batches,   so this one, for example, that fitted really well, so you had the four coefficient, you know, estimates and they were statistically significant.   So what you would do, then, is just export all that data and get a summary table in the same way that we're going to do for the Functional Data Explorer but... So I won't do any more using the fit curve in this demo, but the same approach could be used.   So let's go to the Functional Data Explorer, so it's also in specialized modeling in JMP Pro.   And I'm going to select my response and my milling time and also my ID is my batch number.   So I'm going to not explain again in a lot of detail, bearing in...bearing in mind the time we have,   what modeling to use for this type of data. I know B-Spline works quite well with my data set, so this is what I'm going to use.   And as you can see JMP fitted a model for each of the batches that we have in our data table, seems to fit quite well.   So if I look further down, you can see that it's broken the profile into different components.   And you can see on the graph here where the batches are in spec. Now for this particular response, FPC1 here the top actually explains 96% of the variation in the data, which is pretty good so we wouldn't need, in this case, to   look at FPC2 and FPC3. In this instance, we probably only need to keep the first one. It wouldn't be the case for, for example, for the other response that we have in this data set, but here I'm going to restrict the number of FPCs to one.   And then I'm going to   export my summary data, so when I click on save summary, I have a new table appear   in JMP. This is so, you can see, 38 rows, so I have one row per batch, I have my batch number, I have my FPC value, and I have some prediction formula.   So I'm going to close this table now and use one I've prepared earlier, which has got the   FPC for all the responses that I want to look at in my data set.   Let me do that and switch back to my journal. Okay so we're now at step two, where I have a summary table that I've prepared and I have the columns for each of my responses for my FPCs.   So the first thing I want to do here is to use this   validation   technique for DOE, where I'm going to create extra rows, which I'm going to use for the validation report. So I gave you the reference in the slides if you want to understand more about that technique, you can do that. So we're then going to   fit a model for one of my FPCs.   We shall want to examine as a function of some of my process parameters.   And click on run.   And I'm going to use a stopping rule which is minimum AICc. Click on Go.   So JMP has found several process parameters and also interactions that it's found important. You can see the R squared and R squared adjusted   look good and you get also by using these...adding this extra R square validation, which also indicates that it's looking good and the model hasn't over fitted, for example. So I'm going to click on make model   and   then   I have a model where I can see that milling speed and size was important and some of the interaction terms also important, so what I need to do next is save the prediction formula.   And I can also save the script for when I want to do that again later on and save that to my data table.   So I'm going to close this window. So you would need to do this exercise of fitting the model for each of the FPC values that you have for your data table.   And the last thing that we need to do is to use our prediction formula, so this is the formula I have just saved for FPC1.   And this is the prediction formula that came from FDE earlier on. So if I right click, this is what it looks like. So I have the FPC1 for each batch   and then I have those extra columns which are functions of time. So what I need to do now is, instead of FPC1, the actual value, I'm going to use my prediction formula,   which now is a function of my process parameters. Just literally replace that in the formula here. I'm going to click apply.   OK so again, you will need to do that for any of the models that you generate and then save those in your data table.   So I'm going to close this and we'll go on to the next step in the journal.   Okay, so what you need to do, then, is import back these formulas that we've just saved into the original table, so we can see how the response vary as a function of time and the process parameters. So I'm going to open my time data table and I'm going to open my summary table with my model.   So I'm going to select those columns with the models and I want to copy those columns into   my original table, which I've now lost.   There it is. So I'm going to click paste columns now.   And I just want to double check that it's copied my formulas across.   So, for example, if I go there, yes, my formula has been copied. So what what I now have is   the model which predicts the response as a function of the time and the process parameters. So I'm again not going to save this but use the final table, which has all the models that I need to then do the optimization.   So this is the last step.   So I now have again my process parameters, my milling time, the two responses and then the prediction formula, which would have come across from the summary table. So I'm going to use these two columns now and with the profiler.   And what I need to do also is look at factor grade, so the team wanted to set the API concentration, excipient and size to specific settings that they wanted to find the optimal conditions for. So if you lock those settings, click OK.   And then you can use the desirability functions to set the specification limit, which I have done already.   And then maximize desirability and then these would   get JMP to find the best condition to provide product that would meet the requirement that were set, that we wanted. S this is a technique that we used.   So I'm going to come out of JMP now and go back to the slide for just to show you what the outcome of this workflow was.   So I'm going to switch back to screen. Okay. So out of the 38 batches that we made, that the team in Holland made, there were only four where we had one time point,   at least one time point, where both of those results were within the range that was set. But you can see in the table here on the right, that for the span response that in all four cases, this was very close to the   upper specification limit, and the team were really interested in finding conditions that would generate product where the that particular response was, you know, well within the range   whilst maintaining the other response, also within the target range.   And yeah, we had a great result. The the model with conditions that were selected predicted the span of 1.63. The actual result for this batch was 1.78 and that was the lowest   span that was achieved of all the batches made. So the team were really happy with that result. So despite the slight underestimation of the model, this was still a pretty good result. And you can see in the screen here where   this batch appears in green and it's completely to the left of all the other batches.   And this is why, you know, we were able to achieve a good result. I guess it was, you know, using a slightly different combination of parameter that enabled this result to be achieved.   So just a conclusion really that the Functional Data Explorer in JMP Pro worked really well for this application. It yielded a good predictive model and the best result to date.   We couldn't make use of the fit curve approach so well, despite the initial promising   results that we'd seen at the beginning of the study, and we couldn't use that for the whole data set, but   nevertheless, the team was convinced of the value of looking at profiles over time and the value of this approach. And of course you can apply this to other types of data, for example in formulation, you know, in vitro release or in API development reaction conversion, for example.   And so, this is the end of the presentation, thank you to colleagues in the Netherlands that were involved with milling lots of batches, and   to Andrea who's the co author on this presentation for great teamwork. And we had good interactions between both sides so that led to some great results. So thank you for listening and hope you enjoyed the presentation and enjoy the rest of Discovery.
Clay Barker, JMP Principal Research Statistician Developer, SAS Paris Faison, JMP Statistical Tester, SAS Ernest Pasour, JMP Principal Software Developer, SAS   The Text Explorer platform in JMP is a powerful tool for gaining insight into text data. New in JMP Pro 16, the Term Selection feature brings the power of the Generalized Regression platform to Text Explorer. Term Selection makes it easy to build and refine regression models using text data as predictors, whether the goal is to predict new observations or to gain understanding about a process. In this talk, we will provide an overview of this new feature and look at some fun examples.     Auto-generated transcript...   Speaker Transcript Clay Barker Thank you, my name is Clay Barker. I'm a statistical developer in the JMP group and today I'm going to be talking about a new feature in JMP pro for 16. It's called term selection and I've worked on it with my colleagues Paris Faison and Ernest Pasour. So text data are becoming more and more common in practice. We may have customer reviews of a product we make, we may have descriptions of some events or maintenance logs for some of our equipment. And in this example here on my on my slides, this is from the aircraft incidents...incident data set that we have in sample data. And every row of data is a description of some airline incident, a crash or other other kind of malfunction. So if you've never used the Text Explorer platform, it was introduced in JMP 13 and we primarily use it to help summarize and analyze text data. So it makes it easy to do common tasks like looking at the most frequent or most commonly occurring words and phrases, and it makes it easy to do things like cluster the documents or look for themes, like topic analysis. And everything is driven by what's called the document term matrix. So what is the document term matrix? It's easiest to think of it just as a matrix of indicator variables for whether or not each document includes a particular word or phrase. And we may weight that by how often the word occurs in each document. So each document is a row in our document term matrix and each column is a word. So for a really simple example here, that first line is "the pizza packaging was frustrating." So we have a 1 for packaging, a 1 for pizza and a 1 for frustrating and 0 for all the other words. And, likewise, for the second line, we have a 1 for smell, great, taste, and pizza and 0s elsewhere. It's just a simple summary of what words occur in each document. I've also mentioned there's a there's a couple variations of the document term matrix. The easiest is the binary; it's just ones and zeros. But we may want to include information about how often each word occurs, so here's another slightly longer sentence where pizza appears multiple times. In the binary document term matrix, that pizza column is still a 1 because we don't care how often it occurs. For ternary it's 2, because the ternary is zeros, ones and twos. It's 0 if the word doesn't occur, it's 1 if it occurs once, and it's 2 if it occurs more than once. Regardless of if it occurs four times here, we still coded as a 2. And then frequency is really simple. It's just the number of times each word occurs in that document. So those are three really simple ones, we also offer TF-IDF weighting, which is kind of like a relative frequency, but you can learn more about these weighting schemes in Text Explorer platform. So what's the next step? What is the next thing you might want to use next? You might want to use the text to help with some outcome. So here I've made up some furniture reviews, and there's a star rating that's associated with every review, right. So we might think that those reviews give us some clues about why someone would rate a product higher or lower. So in this very first line, "the instructions were frustrating" and the user only rated us a 3. If that...if that pattern happens a lot and a lot of people are rating... a lot of the lower ratings are associated with instructions, that might tell us that we we need to improve the instructions that we ship with our furniture. So the the simple idea is to use those as a regression model. We're going to...we're going to take our document term matrix, we're going to combine it with our response information, and we're just going to do a regression. And that will help us understand why customers like or don't like our furniture, or whatever product we're making. It can help us classify objects, based on the descriptions, we'll see some of that later, or maybe we want to understand why some machines fail or not based on some of their maintenance records. And the way we do this is really easy. We're really just making a regression model, where each of our X, or our predictor variables, is one of the columns in the term matrix. So if we're modeling product ratings, that's a simple linear regression and if we're if we're modeling a binary outcome, like whether or not a customer would recommend our product is possibly hundreds of words and it would make sense to not use them all. Not all of the data are going to be useful, so we're going to apply a variable selection technique to our regression model and we'll get a simpler model that's that's easy to interpret and that it fits well. And here at the bottom it's easy just to visualize combining those indicator variables with our response rating, our star rating. So our solution that is going to be in JMP 16 is to bring regression models into the Text Explorer platform. If you've ever used the generalized regression platform to do variable selection, we're essentially embedding that platform inside of Text Explorer. So if you have JMP pro, you'll see term selection item in the red triangle menu for Text Explorer. This is what we're really excited about. It makes it easy and quick to build and refine these kinds of regression models. So when you launch this platform, and we'll go through some demos in just a minute, but just quickly, the very beginning of the launch, it's just asking for information about our response and the kind of model that we wanted to fit. So it can handle both continuous responses, like a star rating, and when you specify a continuous response, it provides a filter so that you can filter out some of the some of the rows, based on the response. And when we have a nominal response column, we select the target level. So in this case, would you recommend our product, yes or no. We're going to be modeling the... we're going to be modeling the recommendation equal no. And if we have a multiple level response like blue, green, yellow, we'll be picking the level that we want to model and we'll see an example of this in just a minute. Then, after you after you specify the response, we're going to give the platform some information about the kind of model we want to fit. So when we do variable selection, are we going to use the elastic net or the lasso? Those are both variable selection techniques built into generalized regression. And how are we going to do validation? Do we want to use the AIC or BIC? Additionally, we can specify details about our document term matrix. We can select the weighting and also select the maximum number of terms that we want to consider. So if we want to consider the top 200 most frequently occurring words, that's that's what it's set to now at 200. So what happens when you launch it and you hit that run button? Basically it does everything you need to do. It sets up the data properly behind the scenes so that you don't have to worry about it. It does variable selection, so now we have a small subset of words that we think are useful for predicting our response, and it presents the results in an easy to interpret report, and it's it's quite interactive, as we'll see in a moment. So you used to be able to do this by saving the document term matrix to your data set, launching a generalized regression platform yourself, but you don't have to do that anymore. It's it's all it's all in terms selection now. So that model specification, how do we know what, how do we know what to select? The estimation methods available are the elastic net and the lasso. The easy way to remember the difference between those is that the elastic net tends to select groups of correlated predictors. So in this setting, when our predictors are all words, that means that it will tend to select groups of words that tend to occur together. So if you think of instruction and manual, those words occur together frequently because of instruction manuals. So those two predictors would be highly correlated and the elastic net would probably select both of them, whereas the lasso would probably select instruction or manual but not necessarily both. Validation methods, we have the AIC and the BIC. So sort of the rule of thumb is that the AIC tends to overselect models and the BIC tends to underselect. So in our specifications, it's just that the AIC will tend to select a bigger set of words, while the BIC will select a smaller set of words. And personally I tend to use the AIC a lot in the setting because I would rather have more words than necessary than fewer but really that's a that's a matter of preference. And the document term weighting... the document term matrix weighting, I mean. It really depends on your problem. So in this example, the the word paint occurs in the review multiple times. That could mean that the reviewer was very... took...the paint was very important to that reviewer and that may be meaningful in our regression model, so we would want to use a weighting like frequency instead of binary. So then, what once you launch term selection, you'll end up in a place like this, where we have a summary of all the documents on the left and all of the words that we've selected on the right. And I'm just going to skip over this for now; it's easier to see when we start doing a demo. Another thing that we think is very useful is that we have a summary, so you can you can use term selection to fit a sequence of models and then they're all summarized at the top, and you can switch back and forth between them. And again we'll see that in just a moment. So let's just take a look at the platform. So first we'll take a look at this aircraft incident data set that's in JMP sample data folder. So every row in our table is an incident. And we know how much damage there was to the aircraft, what kind of injuries there were, and we have a description. So this last column is a description of exactly what happened in the incident. So we'll launch Text Explorer, and if you've never used Text Explorer before, on the left we have the most frequently occurring terms, and on the right we have them most frequently occurring phrases, so groups of words. So we want to use these words to maybe understand which...what causes a crash to cause more, you know, sustain more damage or maybe more serious injuries. So we're going to go to the red triangle menu and ask for term selection. So now we see this launch that we were looking at just a moment ago, and I want to learn more about damage, right. And in particular I want to...I might want to discriminate between incidents where the aircraft was destroyed versus less damage, so we'll select the target level to be destroyed. I'll mention that this early stopping, that kind of is a time saving feature. So when we do variable selection, it can be quite time consuming for much bigger problems. And if you leave this checkbox checked, it'll sort of say we think we have a model that's good enough; we're going to go ahead and stop early. I tend to uncheck that unless I know I have a much bigger problem. So I'll leave it with elastic net, AIC and we'll hit run. So now what's happening in the background is it created the data that I needed to fit this regression model, and it's fitting the regression model behind the scenes, and now I have my summary. So every every document is summarized on this left panel, so this is the first aircraft incident, this is the second, which is blank. So this first one is, the airplane contacted a snowbank on the side of the runway. And we have we have all the important words highlighted, so the blue words have negative coefficients, the orange words have positive coefficient. And if we're interested in the words, we can look at this panel on the right. I tend to like to sort by the coefficient so you can just click at the top. So, in this instance, words with negative coefficients are less likely to have been destroyed. So if the document contains the word landing, if the plane has made it to the point where it's landing, it's probably not a very serious incident. So these are less likely to be destroyed. On the flip side, we can look at the most positive terms, and these are the words that are most associated with the aircraft being destroyed. And those are words like fire, witnesses. If they're interviewing witnesses to describe the incident that probably means whoever was in the plane wasn't able to do an interview. So fire, radar, spinning, night, these are all words that are highly associated with with being destroyed. Now, maybe we're also interested in the injury level, and specifically, we're going to look at the worst injury levels. And we'll do term selection again. And I'll just accept that for to keep it quick. So now we can do the same sorts of things, but now looking at injury level instead. So if the if the interview includes the captain or landing, these are probably not very serious in terms of the injury level. But if if again if it mentions the radar, night, mountains, these are much more serious words. So we can kind of quickly go back and forth between these two...these two responses and see which words are predictive of those two injury level and damage level. So let's take a look at another example. So this is one that I really enjoy. Earlier, I was talking about pizza in my examples. Now we're going to talk about beer. So these are obviously things that are near and dear to my heart. So what I've got here is I've I've got...I downloaded a description of every beer style, according to some beer review body. So every single beer has a description. And it says, where the beer came from. So we have... the United States, Germany, Ireland. Use term selection to see  
Peng Liu, JMP Principal Research Statistician Developer, SAS Jian Cao, JMP Principal Systems Engineer, SAS   This talk will provide a comprehensive review of major updates in two time series-related platforms. More specifically, the updates include a forecasting performance-based model selection method, enhanced functions for studying the recently added state space smoothing models, and analysis capabilities using Box-Cox transformed time series. We will explain the motivations behind development efforts to help identify interesting use cases of the new features. We will present a few examples to illustrate some of the many possibilities for how these new features can be used. JMP 16 represents a major upgrade for time series platforms. Equipped with the new features, JMP opens the door to many intriguing new discoveries in time series analysis.     Auto-generated transcript...   Speaker Transcript This talk is to highlight some 00 11.600 3 time series platforms. Three are from time series platforms and 00 34.700 38.400 10 11 12 do we need Box Cox transformed time series? Let's take a look at the data 00 00.933 06.200 18 also known as a as an airline passenger data set. The original series is 00 23.900 23 model from. Why? Let's take a look at a plot... 00 43.433 27 getting larger. And this series cannot be handled by the 00 01.666 31 in the second picture. So the variation does not change with the various times series ??? 00 20.266 36 So in the literature people will say, well, we will transform 00 33.800 38.700 41 42 43 of the transform scale, in this case here, it's the log scale. Sending it to inverse 00 04.366 49 transform. In the past...in the past 00 21.966 53 streamline the whole process. What you need to do is to put 00 39.133 57 need to do the models, make forecast, then the software, 00 55.933 61 will put log passengers into Y, but now we don't have to. We 00 15.033 66 to enter the Box Cox transformation parameter Lambda. Zero, it means it's a log 00 28.333 31.166 37.433 73 the red triangle menu and click either ARIMA or seasonal ARIMA. 00 01.766 78 12 for seasonal part. Without intercept. Click 00 19.633 22.933 24.500 28.733 85 86 forecast taking care of the inverse transformation. The 00 49.600 91 will will show in this. plot and the forecast had 00 05.033 08.800 11.100 17.866 98 models is a workhorse in time series forecast platform. They can fit and forecast a lot 00 35.733 39.633 104 performance is somehow comparable to the forecasting 00 59.366 110 111 study why it...why this type of model works and why some some 00 18.600 115 116 type model into the time series platform which is designed to 00 43.900 120 a function of the unknown state, unobserved state. Here at 00 01.000 124 variables and the error term by either additive operations 00 16.266 19.200 129 130 state is the level state time series. Trend state forms a 00 40.733 134 135 state, and also one of the previous seasonal states. And 00 59.733 03.133 140 141 previous trend state will tremd to the next. trend state and the level state 00 25.966 146 point to another time point. And there are more arrows...that there are more states transitions than is 00 49.700 152 series into Y and click OK. To fit this type of model, we 00 12.300 16.766 160 161 set, I'm going to enter 12 for period. And I'm going to click Select Recommended button. From the additive error models and 00 47.666 169 this particular set, I'm going to click constraint parameters 00 05.966 173 recommended models to fit these type of...these time series and 00 23.666 27.966 179 180 model with smaller AIC and my eyes are on the first two 00 52.300 184 models. And let me overlay the forecast 00 09.366 10.566 17.666 190 191 192 193 from the original time series more nicely. So in my preference, I would 00 45.133 198 difference? Let me open the first one MAA...MAM. Let's go down below. This 00 05.533 10.666 204 this one component states. This is a special for this 00 32.900 209 the first letter. And the trend is additive by 00 50.233 213 second part of this report are the...are the state component 00 06.100 217 part is the prediction of this specific state. The period of the time series 00 27.500 222 223 has an increasing pattern in the past. It keeps increasing 00 48.266 227 series and the pattern continues toward the future, and this 00 03.466 231 observed, but the forecast is flat. This bothered me. Now let's look at second 00 17.333 20.366 23.566 27.933 239 240 state component graph. Level is increasing in the past had 00 44.366 48.466 245 246 247 248 future. This is more reasonable plot that I can accept. So is it 00 14.866 252 on to the second slide. This slide and then the next 00 45.100 259 on interpreting the forecasts from from this model...this type of models. Here I would like 00 01.933 07.233 265 up. I listed half of them here. Oh, nearly half. So let's focus 00 27.033 30.800 271 some increase trend and will taper off towards the end. And on the other hand, we can 00 54.233 277 see from the forecast using this type model. If seasonality is not involved. When I 00 11.066 14.200 19.566 284 285 286 the first one, this is a flat forecast if the seasonality 00 46.833 51.266 291 have a linear increase patern and so on so forth similar to the others. Now 00 06.800 12.566 297 it's merely increasing. After applying the 00 34.033 302 the multiplicative seasonality on the top of our increasing 00 48.966 306 this type...different type of shapes flat 00 05.666 10.733 311 312 we get those different...different shapes. So I I re entering ??? 00 33.466 40.433 318 what we eventually see in the forecast. You have the flat patterns or 00 00.733 323 parameters. So I separated these parts and also I 00 14.866 20.400 328 329 330 trend will usually look flat, we will get an increase pattern in the level state. When it's linear and when it's curved. It's all depends on how this 00 52.233 57.666 339 increasing or decreasing in the level exponentially. So this is 00 13.466 343 is lean and think of it as compound interest rate if if the level state increase 00 32.466 348 they make forecasts, they try to... try not to overshoot or undershoot the forecast that 00 59.833 355 how to interpret the forecast from state 00 20.133 359 second one, none of of these models are stationary. They are 00 37.466 364 So if you are considering these time series. Things 00 55.566 59.600 369 third one, if you just see that time series not 00 12.000 373 374 a result in a...in the next slide that will fit 00 31.933 378 compare across type of model be careful. This slide is to show how... 00 51.800 383 is the forecast. And similarly, I plot my 00 07.400 387 apply these type of state space smoothing models to stationary time series? Here I simulate a 00 32.633 393 394 395 models to this time series, the best model turns out to be in an 00 54.266 399 400 rather different becauses it is a random walk model and the 00 13.466 16.466 18.733 22.333 408 feature in this presentation forecast on holdback. This feature allows you 00 37.966 40.500 45.833 415 416 417 418 one is from another model. And then you can compare these 00 11.833 423 424 to activate this feature. Then I need to specify the length of the holdback 00 35.200 38.466 430 431 432 click Select Recommended, and check Constraint 00 06.433 438 439 portion of the series, we listed the holdback 00 26.000 443 444 by default, but you can always change the metrics you 00 39.900 44.866 449 reports are similar to to that got from the analysis results without activating this 00 05.433 454 455 let's let's let me summarize what we have learned from 00 25.566 459 performance over the holdback data. But those criterias are 00 41.366 463 process. We see the rather different from how we use 00 57.700 467 part of the model fitting process, so this is something 00 17.000 471 holdback to evaluate different models based on their forecasting performance. So we 00 32.033 35.366 478 column is that time series indicator. Y is time series 00 55.666 482 summarize the data set, either time or time series, by 00 09.533 12.066 16.600 488 489 specification or change the model selection strategy, we 00 39.200 494 check...change selection in the first combo box to forecasting performance. Then we can choose forecasting performance 00 04.633 500 we want forecast. But you can choose any...change to any 00 17.833 25.466 30.866 506 507 using the training time series, select the best 00 49.800 52.333 54.766 57.900 00.966 515 series platform. First analyze Box Cox Transformed time series. The second one is fit state 00 27.666 521 522 as well and using it as And model selection method. Thank you very much.  
Hadley Myers, JMP Systems Engineer, SAS Chris Gotwalt, JMP Director of Statistical Research and Development, SAS   The calculation of confidence intervals of parameter estimates should be an essential part of any statistical analysis. Failure to understand and consider “worst-case” situations necessarily leads to a failure to budget or plan for these situations, resulting in potentially catastrophic consequences. This is true for any industry but particularly for pharmaceutical and life sciences. Previous work has explored various methods for generating these intervals: Satterthwaite, Parametric Bootstrap and Bias-Corrected (Myers and Gotwalt, 2020 Munich), and Bias-Corrected and Accelerated (Myers and Gotwalt, 2020 Cary), which were all seen to have error rates that were too high for the small samples typical in DOE situations. Therefore, we make use of the new “Save Simulation Formula” feature in JMP Pro 16 in an add-in that improves upon these by allowing users to perform a “Bootstrap Calibration” on the Satterthwaite estimates. The add-in also includes the ability to do this for linear combinations of random components, taking advantage of another addition to JMP Pro 16. Further, we investigate a new version of the fractionally weighted bootstrap that respects the randomization restrictions of variance component models, as an alternative to the parametric bootstrap, using the new “MSA Designer” debuted at this conference.     Auto-generated transcript...   Transcript Hello, my name is Chris Gotwalt. My co-presenter Hadley Myers and I are presenting an add-in for obtaining improved confidence intervals on sums or linear combinations of variance components. This is part of a series of talks we have given as we work on improving and evaluating several approaches. Obtaining confidence intervals on sums of variance components is important in quality because it provides an uncertainty assessment on the repeatability plus the reproducibility of our measurement system. The problem is that when we ask for a 95% confidence interval, there are approximations involved and the actual interval coverage can be as low as 80%. In our previous studies, we found that two methods have improved coverage rates, parametric bootstrapping and Satterthwaite intervals, but it was still less than 95% in small samples. The earlier version of the add-in implemented the parametric bootstrap as a stopgap and Elizabeth Claassen implemented the Satterthwaite intervals in the Fit Mixed platform natively in JMP Pro 16. I want to stop here to give Elizabeth Claassen credit for making interval estimation of linear combinations of variance components so much easier and JMP Pro 16. She also greatly extended the Mixed Model output, which has made this presentation vastly easier. I'm also hoping that this presentation will serve as an inspiration to others to check these new Save commands out so they can get more from JMP Pro’s Mixed Model capabilities. Now we're going to combine the two approaches using a technique called Bootstrap Interval Calibration that was introduced by Loh in a 1991 Statistica Sinica article. Bootstrap Calibration is a very general procedure for improving the coverage of confidence intervals that can be applied to almost any parametric statistical model. I'm going to introduce the basic idea of Bootstrap Interval Calibration in the simplest terms that I can, and hand the mic over to Hadley, who's going to demo the add-in and discuss our simulation results. To make this simple, let's make it specific. Consider a very small nested Gauge R&R-type study where we want to estimate the total variation. We collect the data and run a nested variance components model with an Operator effect, a Part within Operator effect, and a residual effect. The software reports a Satterthwaite-based interval on the total. It's well known that this is an approximation that assumes a “large” amount of data is present in order for the actual coverage of the interval to be close to 95%. In small samples, the actual coverage, the probability that the interval procedure generates intervals that actually contain the true value of the estimated quantity, will tend to be less than 95%. Thing is the actual interval coverage is a complicated function of the design, true values of the functions, and a long list of other assumptions that are hard or impossible to verify. What we can do though is used the fitted model and their parameters to do a parametric bootstrap. When we do this, we know the true value of the quantity we are estimating because we were simulating using that value. We can do the simulation thousands of times. We apply the same model fitting process to all the simulated samples. We can collect the intervals from JMP and calculate how often they contain the generating value of the quantity that you were interested in. In this case we were interested in the sum of all the variance components, so the true value is 4.515. Suppose we took our original data set, took the estimates, use the Save Simulation Formula that is comes from Fit Mixed, and generated a large number of new data sets, and applied the same model fitting process that we applied here to each of them, and we collected up all of the confidence intervals that were reported around the total. After having done this, suppose that that...the estimated coverage, the estimated number of times that these intervals actually contained the truth, turned out to be 88%. So we wanted that 95% interval, but the Bootstrap procedure is telling us that the actual coverage is closer to 88%. Now we can play a little game and we can repeat the Parametric Bootstrap using a 99% interval this time. So we go through that process, we redo all the bootstrap intervals and when we did the 99% interval we get an actual coverage of approximately 98%. Now suppose we did this game over and over again until we found an alpha with actual coverage approximately 95%. So in this case, suppose we did that and we ended up with finding that 97.6% when we asked for a 97.6% interval, we actually got something like a 95% coverage. Then what we can do is set 1 minus alpha to 0.976 using the Fit Model launch dialogue, set alpha option and will get an interval that has been Bootstrap Calibrated to have approximate coverage 95%. This is still an approximation. There is still a simulation component to it, as well as a deeper underlying approximation that is extraordinarily hard to analyze, but it can be made easy to use, and this is where Hadley comes in. Now I'm going to hand it over to him and he will demo the add-in and go over the simulations that he did that show that we are able to get better coverage rates than before by applying Bootstrap Calibration to Satterthwaite intervals on linear combinations of variance components. Take it away Hadley. Thank you very much, Chris, and hello to everyone watching online wherever you are. So I'm going to start out by showing you how the add-in works and how you can use it to calculate Bootstrap Calibrated confidence limits for random components in Mixed Models in JMP Pro 16. And from there we'll take a step back. We'll see how the add-in makes these calculations and I'll highlight some of the additions to Mixed Models in JMP Pro 16 that allow it to do that. From there, I'll show you the results of some simulation studies to give you an idea about how accurate this interval estimation method is, the Bootstrap Calibration method, and how it compares to some of the other methods for calculating confidence limits, as well as the situations where it's more or less accurate and some of the limitations and things you should be aware of if you're going to be applying it. We’ll discuss possibilities for improvements in future work just briefly, and from there, I'll conclude by showing you the new MSA Designer, Measurement Systems Analysis Designer, available from the DOE menu in JMP Pro 16 so that you can quickly and easily design and analyze your own MSA Gauge R&R studies. So let's start out by looking at this data set. This is one that I pulled from the sample data files. I'm going to run this Fit Mix script here that I've saved. So what we've got here are our random estimates, estimates for a random components. Now. it could be that you want to, for some reason, calculate an intermediate total, for example Operator and Part nested with Operator, or the three of these, you know, Operator and residual. So to calculate those is very simple, we simply add these estimates, but what's not so simple is to determine those confidence limits. There's a new feature in Mixed Models that's been added in 16. The linear combination of The linear combination of variance components feature right here, and so what you can do is you can click that. You can choose the combination of variance components that you're interested in, and you can press done. So now we have an estimate for those. Components as well as their confidence limits. So, what I'm going to do now is I'm going to take this one step further and I'm going to calculate the Bootstrap Calibrated Satterthwaite estimates and I'm going to do that by going to my add-ins and clicking the Bootstrap Calibrated confidence intervals there. So from here we can estimate the number of simulations. 2500 is a recommended number to the default number. It's also the default number in some of the other simulation platforms and in JMP Pro. I'm going to choose this one. But one thing to note is that it takes some time to be able to do this, and so in the interest of time what I'm going to do is I'm going to stop it early. And here we have our calibrated intervals, calibrated upper and lower confidence limits added to the report. So let's take a step back and see what happened there. I'm going to go ahead and add this again. Now, one thing that the add-in does, as soon as you run it, is it adds this simulation formula to the data table, so you can see the simulation formula here. When the add-in is closed, the simulation formula disappears. The simulation formula there takes advantage of another feature that's been added to... to the Mixed Models platform in JMP Pro, and that is the Save Simulation Formula feature here. So what this would allow you to do is to save the simulation formula and then to use that, for example, to simulate these values here. So, we can swap out our “Y” with our new simulation formula, and go ahead and run that. So when you run the add-in, this is all done in the background. But this is how the add-in goes about calculating these intervals. So I'm going to stop this early, once again in the interest of time. And now we see here the samples estimated for each. simulation. And so how the add-in works is it takes all of these. And it calculates new estimates for the upper and lower Satterthwaite intervals from this estimate and this standard error, swapping out different values for alpha. So what we're aiming for 0.05, right? So that we get 95% upper and lower limits, and what it does is it finds an alpha value that results in 95% coverage, that is 95% hits and 5% misses, swaps that in, that's how you get your calibrated intervals. So I hope you enjoyed seeing that. I hope you find it useful. We've done some simulation studies and what we found out is that the intervals, which you can see here for four operators and 12 days as our random components, we've achieved misses of about 7%, so a 92.8 hit ratio. Now this is better than all of the others, including this, so the linear combination, which is simply the standard Satterthwaite interval calculated on the combination of linear components, as well as the Bootstrap quantiles, the bias-corrected intervals in the bias-corrected and accelerated intervals, but as you'll see these intervals improve, all of them, as you increase your number of Operators from 4 to 8 and the number of Days from 12 to 24. So increasing the levels of these random components result in much better, much more accurate estimates for the confidence limits, and so much so that we now have a method here that is equivalent, just, to an alpha value of .05. So. this improvement in performance of course, comes at a cost, and one of those costs is the length of the intervals. And so you can see here, that with our Bootstrap calibrated, well with all of our intervals in fact, that when we have increasing number of Operators, that the length of the interval is much more bundled closer to 0 than it is when you've got smaller number of Operators. You can see that this tails out much further, so that's this blue area here. That's true for all of them, but it's especially true for the Bootstrap Calibrated interval. You can see this long tail here. On average, you're going to get longer lengths using this method, but you have a more accurate method. Exploring that a little bit deeper, you can see here that this increase in length is true for four Operators, as well as eight Operators, and it is significant. Statistically significant. The other thing that I looked at, is the effect of adding repetitions, so the difference between two repetitions and five repetitions, and what you'll see here is that there really is no difference. So looking across the different sets of combinations from four Operators and two reps to four Operators and five reps, about 6 measurements total versus 3 measurements, we really don't gain anything. All of these are equivalent to each other. So that's something to be aware of, that you see improvements in accuracy when increasing the number of Operators, and you don't see improvements when increasing the number of repetitions. One thing that I'd like to mention as a possibility to improve upon these results is the Fractional Random Weight Bootstrap, which we would have liked to have been able to implement for this in time for this conference. We weren't able to do that, to take this and to apply it to random variance components, and so we hope to be able to do that in future work and perhaps even see an improvement upon the Bootstrap Calibrated interval. And then the other thing that I'd like to highlight before I go is the new MSA designer that's been added to JMP 16, and so from here what we can do is we can very quickly create our own design in order to be able to perform our own MSA or Gauge R&R analysis. And so let's see, I'll do this with three Operators and Five parts. I'll label these A, B and C. And we'll do one repetition of each. So that's two measurements total. So here we've got a table with our design. What I can do is I can press this button to very quickly send that to the different operators, have them fill out their parts, send that back to me. And then I can add those results together. So I'll just sort this because I've got another table over here where I've done this ahead of time. So I'll just add these values over there. And now from the scripts within the table we can quickly and easily do our own Measurement Systems Analysis and Gauge R&R. So I hope you found this useful. I hope you continue to enjoy the talks at this conference. Thank you very much for listening.  
Damien Perret, PhD, R&D Scientist, CEA François Bergeret, PhD, Ippon Carole Soual, MS, Ippon Muriel Neyret, PhD, R&D Scientist, CEA   JMP software was implemented at CEA in 2010 by R&D teams who develop nuclear glass formulation. A first communication occurred at Discovery Summit 2011 in Denver, when we explained how we use JMP statistical analysis platforms to compare glass composition domains with a high degree of complexity. Then, many improvements were made by developers to provide JMP with powerful methods for generating mixture DOEs, in order to investigate highly constrained experimental domains. During Discovery Summit 2014 in Cary, we showed how all these efforts enabled us to build even more accurate property-to-composition predictive models. A very innovative methodology was recently developed by glass formulation scientists at CEA in collaboration with Ippon statisticians to predict the glass viscosity. Our approach is based on an automatic and intelligent subsampling of the data, and combines techniques of optimal designs and several predictive methods in JMP and JMP Pro. Predictions appear to be very accurate, compared to those obtained from other statistical models published in the literature.     Auto-generated transcript...   Speaker Transcript Damien PERRET Hello welcome, and thank you for watching this presentation for the Europe Discovery Summit conference online. My name is Damien Perret. I am an R&D scientist at CEA in France, and I am along with my colleague and friend Francois Bergeret, statistician and the founder of Ippon Innovation in France. My name is Damien Perret. I am an R&D scientist at CEA in France, and I am along with my colleague and friend Francois Bergeret, statistician and the founder of Ippon Innovation in France. So with Francois, we are very happy to be here today, and we would like to thank the Steering Committee who gave us the opportunity to go about this one, which is about advanced statistical methods applied to glass viscosity prediction with JMP. So let's start with a few words about the French Alternative Energies and Atomic Energy Commission. CEA is a French government organization for research, development and innovation in four areas defense and security, low carbon energies, technological research and fundamental research. The CEA counts about 20,000 people on nine locations. We have strong relationships with universities through various joint research units, high amount of patents and start-up creation, with a budget around 5 billion euros. Fran?ois Bergeret statistics and data science, including studies, consulting and training. We are very proud to be general partners since several years. I'm also very happy to present with my friend Damien today. Ippon also proposes advanced solutions for zero defect and process control. I'm personally a JMP user since 1995 with JMP 3. Damien PERRET So our main objective in this work is to create statistical models to predict the glass properties, and for this talk today, we focus on the glass viscosity. To do that, experimental data are coming from both commercial database and from our own database at CEA. We wanted algorithms to be coded in JSL and implemented in JMP Pro 15. The response of the model is the glass property of interest, so viscosity for this example, and the factors are the contents of the different glass components. So, here are some background information. Glass is a non-crystalline solid. It is obtained by a rapid quench of a glass melt, and from a material point of view, a glass is a blend, a mixture of different oxides. So the number of oxides is variable, from two or three in a very simple glass to about 30 and even more in the most complex compositions. There is a long tradition in the calculation of glass properties and we think that first models were created in Germany at the end of the 19th century. Since then, the amount of published literature in the field of glass property prediction has tremendously increased, so that today we have a huge amount of glass data available in commercial database, which also offered and used to predict the glass properties. But despite of all efforts that have been made in the past to predict the glass properties, challenges remains for the prediction of the glass viscosity. And this, because the glass viscosity is a property that is difficult to predict. First the viscosity is very dependent of physical mechanisms that can occur in the glass melt, depending on the glass composition, like phase separation or crystallization, for example. And also the viscosity is the only property having such a huge range of variation up to certain orders of magnitude. So here is an example that shows this difficulty. We have selected three composition of SBN glass, which is a very simple glass, with only three ???. And we applied the best known models from the literature to calculate the viscosity. And then we compare the predicted values with the experimental value we have measured with our own device. So you can see that even for a very simple glass, it is not easy to obtain one reliable value for the predicted viscosity. So here is a picture, a good picture we like to use to give a view of the database, where each dot is one glass in a multidimensional view of the domain of compositions. So a data may come from isolated studies or we can have data coming from studies using experimental designs or we can have data obtained with the valuation of one component at a time. We spent a lot of time in the past to apply different machine learning techniques by using the part of the data found in the entire database. And a classical approach was used on a validation set but at the end, no statistical model with an acceptable predictive capability was found to predict the viscosity. So we have decided to use a different approach. So instead of using all the data, we think it's better to create a model by using data close to the composition where we want to predict the viscosity. So, for example, if we want to predict here on the red dots, one model will be created from the data we have in this area and a different model will be created if we want to predict the property on another composition. So that's why we say that this technique is dynamic. It's because the model depends on the composition. It is related and fitted where we want to predict. And we say it's automatic because we don't have to do this manually. Every step is done by algorithms implemented in the tool. So one of the most important point is certainly the determination of the optimal subset of data to create the model. For that we have implemented two methods of subsampling. So in the first method, a theoretical or virtual design of the experiment is generated around the composition of interest. And then each run of the design is replaced by the most similar experimental data present in the database, leading to the final training set. And the second method we have implemented in the tool is based on different sizes of data sets created around the composition of interest. A small data set is generated by the tool, and models are created on this small subset to predict the viscosity. And then bigger and bigger data sets are generated, and the optimal size is evaluated by statistical criteria associated to each subset. Fran?ois Bergeret Glass viscosity is not easy to predict, so we decided to use different statistics and machine learning method. Polynomial ??? models with transformation, generalized regression using a lognormal distribution. This method is very powerful using JMP Pro and can be give better results that the ??? models with transformation. We also use neural networks, very powerful in terms of prediction. As we have two data sets, as mentioned earlier, we have six predictions for each response. Next slide is a schematic, a view of the tool. Inputs are the composition of the glass and the temperature at which the viscosity has to be predicted. If we look here the code and the algorithm have been implemented for the two method we described just before. The strength of the tool is that, instead of getting only one prediction, six values are calculated with a statistical criteria associated with all data that can be evaluated by the user. Damien PERRET So, here are some of the key parameters. It is very important to take into account as many inputs from the glass experts as possible. For example, we had to create specific algorithms to enter with nature and the role of oxides on viscosity. Another point of major importance is related to the origin and so reliability of the data. For this, a significant amount of time in this project has been spent to the constitution of a reliable database. So we had to implement weights and we had to study different ways of calculating the distance...the distances between the glass compositions. So now it's time.... Fran?ois Bergeret Okay I'm going to show the screen now. To show you a demo of the code, so you should see my screen now. And I'm just executing the code so it's a complex JSL program. We have been developing it during several months with CEA. So I just executed the code and now I'm going to show it to you. Discovery...so here I'm opening files for the code. And it's running, okay. The code is executing, so I will comment, a little bit. We have several loops in this code. Of course, the first step is to identify the data and the functions. And after that we have a loop, first of all, we have what we call the adaptive iteration and the reason for the database. So and because it is adaptive, as mentioned by Damien, you're looking for the best subset of data. And we have also here the design of experiments approach, whereby optimizing design, we are getting the right data. After that and it is running actually, we are predicting the glass transition temperature. Okay, and we have, as I mentioned, three models and for each models, we have two database, so we have a total of six predictions. So, as you see, it's a little bit long to execute, but it's lasting something like one minutes, and when it will be done, we will have all the output of the programs of the glass transition temperature and, of course, the viscosity. Using JSL has been very, very useful and, in addition, in terms of users, as you will see with Damien, it's very easy to use and to to use for the experts. So Damien, you can talk and stop my sharing. Damien PERRET Okay, so. Just one. Can you see my screen now? Yes, okay so. Okay, so this is a general statistical report created by the tool. So first, we have the composition of interest of the glass. And then we have on this graph, the predicted values. So on the Y axis, we have the predicted values of the viscosity calculated by the three algorithms and for the two methods. And on the X axis, we have the number of the enlargement for the second method. And in red, we have the median of the predictions, which can be sufficient for a non statistician user, but if we want to investigate the statistical details, we have a lot of information in this report to study quality of each each model. For example, we can check the values of the PRESSS or the different model. Here is for the multilinear regression BIC F model. So here we see that the PRESS values tell us that the prediction with method number one is a little bit better than for the second method, and we also see the model liquidation with the enlargement of the training set on. We also have different statistical values. For example, we have the R squared value for the different algorithm and for the different models. And here we have even more details on the model. For example, for the first method we can compare the theoretical and the actual design of experiment. We have the predicted... prediction formula for the different...for the different model. Also we have some information on the...on the estimates, and we also have many information for the...for the second method. So at the end, we have a lot of statistical details and information that are very, very useful to the user. And here, at the end, we have the composition of the most ??? of glass in the database, for which we have an experimental value of the viscosity, so this is very, very useful also. Okay, so let's go back to the PowerPoint. Okay, so this is the results we obtain. The two predictive capability was evaluated by extracting 230 rows that forms a global database. And in this table, we have the relative error of the prediction...of the viscosity prediction for different types of glass and for the global subset of data. Three quantiles are given as a median, meaning that 50% of the predicted values have a relative error that is below the value indicated in the table. And we also have the 75 and 90% quantiles in this table. So when we talk about glass viscosity, traditionally we consider that predicted error around 70% is very good. So we can see that for the majority of the data the model capability is fine and we were very happy with the results we obtained there. As a comparison, here are again the results we obtained for the very simple glass, SBN glass, with only three oxides when we applied the models available in the literature. So we can see that the value...values for the relative error of prediction were much higher and could vary a lot from one model to another. And again, this was for very simple glass with only three oxides. And in some case we have errors that are more important, but if we look at this data in detail, so here on this graph, we have the ??? values on x axis and experimental values on y axis, we see that this biggest error of prediction is obtained for for these two glasses coming from the commercial database SciGlass from ... the same reference, which is a patent and for which the experimental error of the equipment is not mentioned. And also for all these compositions that are high aluminum content, we think that crystallization is very likely to occur, and then we can't be totally sure that the experimental values were correct. And finally, we applied our methodology to predict another glass property, the glass transition temperature, which is an important property in glass technology. Here are the results we obtained, which are even better than for the viscosity. And here, so the overall relative error of prediction is below 5%, which is really good, because we know that this property can vary a lot, depending on the thermal history of the glass and depending on the experimental device. So here are the two capabilities, very close to the experimental error, which is very nice. Fran?ois Bergeret Okay, as a conclusion, one important feature of our approach is a dynamic subsampling of the global database. We address the right information around the composition of interest. In addition, using JSL and JMP Pro, we have automatized the machine learning models, general regression and neural networks are very performing. According to the CEA expert, accuracy is good and reveals some unexpected issues. We plan now to expand the models on a bigger database and also to work with Bradley Jones and maybe write a joint publication. Thank you.  
Beatrice Blum, Senior Statistician, Procter & Gamble Service GmbH Phil Bowtell, Principal Statistician, Data and Modeling Sciences   With sensors now being economically available, P&G massively expands its use of sensors to develop new and better test methods. Sensors deliver discrete measures over a continuum like time or location often resulting in smooth curves. However, the metrics that we extract from these sensor data are blunt summary statistics like averages, sums and integrals. Those are believed to represent different consumer-relevant product features, but we struggle to establish robust mathematical links. Using historic approaches, a lot of information about the product performance that we measure along the way are not leveraged. We propose to apply Functional Data Analysis (FDA), a mathematical approach to spline fit any type of curves, to extract discriminating curve characteristics representing product features. Using case studies from Baby Care, we show how to turn sensor data into meaningful information. In addition, we compare FDA with PLS in SIMCA to understand when to use each method. We envision that matching these fits with consumer data will enable creation of a product portfolio landscape, empowering us to understand what optimal product performance, the so-called Golden Curve, looks like. Eventually, our goal is to design diapers, pads, razors and more against identified consumer-relevant Golden Curves by optimizing product composition.     Auto-generated transcript...   Speaker Transcript Beatrice Blum Hello, and thanks for joining Phil's and my Discovery presentation today with a glimpse of my fabulous 2021 lockdown hair style.   We will be talking about how we approached some sensor data, why the use of functional data analysis (FDA) and partial least squares   (PLS) in our pursuit to catch the golden curve. My name is Beatrice Blum and I'm a statistician in the data and modeling sciences department of Procter and Gamble, supporting baby and fem care R&D in Germany. My co author is Phil Bowtell from the UK. Phil, do you want to introduce yourself? Phil Bowtell Thank you very much, Bea. Hello, my name is Phil. I'm based in the UK and like Bea, I'm a statistician as part of the data and modeling sciences group. And I support a variety of technical sciences in Europe, including baby care with Bea. Thank you. Beatrice Blum So what we want to cover today is a quick introduction to the data that we have collected and how we're trying to figure out the meaning of the different curve shapes with respect to our consumer responses, Yield 1 and Yield 2.   We will pay particular attention to comparing two analysis approaches to these data (PLS and FDA) and try to understand when to use which.   Note that we assume some knowledge of PCA, PLS and FDA for this talk, but what you really only need to know is the general concepts and data...how data is organized.   But it's very likely that you will still be able to follow the course of this talk, even if you're not familiar with it.   So you may be aware that Procter and Gamble is developing and manufacturing diapers. To improve these diapers and their product performance in the eye of the consumer,   we try to capture and understand the important features of a diaper. Particular in some of our test methods, we apply fluid to the diaper in different locations and under different   protocols or conditions and measure K data curves, as seen here on the left, and P data curves, as seen on the right.   We assume that these K data curves are somewhat linked to our consumer response called Yield 1, and that these P data curves are related to our consumer response, Yield 2.   So let's first look into the K data curves and analyze these or try to fit these with the help of Functional Data Explorer in JMP. With that, I'll switch to JMP.   So here is my data table in JMP.   It's a very limited data table in terms of columns. We have one column for the 10 products that we have been investigating, A to K.   For each of those products we have run three replicates in our method. I combined the two columns into an ID column and it's consisting of the sample name and the replicate number.   We collect the data, over time, continuous variable and our raw signal is called K raw.   So let's have a look what the K raw is looking like.   I just picked two products, in this case G and I, because their profiles seem to be quite different. What you can see, we have pieces of where the curve steps up jumps up and then it flattens down in a quite...quite smooth behavior.   The stepping up is no big issue for PLS, which was created in terms of trying to model spectral data, while for an FDA (functional data analysis)   that would expect smooth curves and also smooth derivatives, which are probably not given if the curve is just jumping up.   These jump ups are related to a sauce(?) that we apply to the diaper. Can also see the three replicates, so the method seems to be nicely reproducible.   Quite nice. However, we see a lot of noise in our raw data. It sinks again, oscillating down here and we assume that we will model a lot of noise and overfit   just because we have so much oscillation over here. So what we found it indicated to smoothen the curves prior to fitting, and that's what we see down here. We have smoothened the curves by the use of moving average super sample(?) with a window size of 20.   With that I'll go over and try to fit   functional   components to this. So I put my variables into the corresponding roles. Instead of the raw data, I use the smoothened K. I need my ID variable and my ID function. My X is the time over which we measure our K, and eventually what we want to achieve as linking these data to our Yield 1   continuous variable, and try to understand how our predicted curves are related to this consumer response. So I put this into the supplementary role.   Run it, get the original output from the FDE, and usually I just start by fitting B-splines because that's nice and easy and relatively fast.   Can see that this is only taking a couple of seconds, despite us having quite a couple of thousand lines. So we get a result. It doesn't really look bad...that bad from afar, however let's drill in a little.   As already mentioned, when talking about what functional data analysis was developed for, it is expecting smooth curves.   And the B-splines do actually just stitch together in this particular case cubic pieces of splines   and to get around corners like here, where there is no cubic...certainly no cubic curve but a real change in behavior and a turning point,   it has to go around and try to somewhat capture that behavior, but you can see that it's doing a really poor job. It's also not doing a good job in trying to represent these plateaus that we observed at the top.   So, despite being very fast, simple and, in most cases, a really good approach in this particular   context, it's probably not the best to go for B-splines. Instead of that,   I read a little bit in Help and did a bit of research and found Okay, we should use P-splines if we have profile data, if we have something like spectral data. That's what JMP recommends, so we went for the P-splines.   And because I really saw that we have step changes here, the only way to attract...attract this is by using the step functions.   And since the P-splines take much longer time to fit, I prefit those and we will just have a look at what the results look like.   So, again we look at the actual and predicted lots here and see, oh they're doing a much better job in getting these turning...turning points and in   achieving the step upwards. They also somewhat captured the plateaus round here, but it's already assumed previously yeah, they also still do   capture quite a little bit of the noise. So this is not really smooth. It would be actually super nice to have maybe a B-spline fit this   degradation type style of downwards hill. Maybe just step P-spline in this area where it's really needed, but at least this is a lot better than our B-spline fits. So let's see what's happening.   So it did quite a good job in putting our different products on it, two dimensional score plot. We can clearly see how the replicates of different products group to each other.   There are the A's; we have the H's and so on. Some of them are not so good, so the I's are a little bit further distributed and overlapping with others.   However, we can see that the good reproducibility that we saw in the raw data seems to be playing out well here. We decided to go for four FPCs,   as seen here. And we can see that they're quite nicely predicting our curves.   But, eventually, our goal is obviously to see how our consumer response relate to different curves. So what JMP is doing here in the background, in the generalized regression, is fitting each of those four FPCs by the use of Yield 1.   And with the results from that, we can now see how changes in Yield 1 changes the shape of the curves; could clearly see an upward strand.   So   seems relatively easy to capture what's going on, so this is a really bad product, so it's certainly much up, lot of plateaued up here,   and this seems to be a lot more down. So we have found something where we think that may be close to a golden curve for our Yield 1.   However, when we are looking at the data that we really collected from the consumers, and now not on a continuous K, but we just put them in order on a categorical scale.   We have to see that these four products that came out almost identical from 0.31...   0.031 to 0.034. If we look at the curve shapes and how they change on the left, we will see they're quite different.   So it's not entirely in line; you could even say it's not at all in line with what we've seen in fitting the continuous Yield 1. So through the very one down here, and this would...they look so similar despite having quite a big difference already in consumer response.   And again, this one is also not so different with respect to what we've seen from the continuous one.   With that I return to my slide deck   So back to my slide deck. Here we can see how we fit the data that we extracted from the FDA fits to our Yield 1.   And you can actually see that this is a very, very good model. It's so good that we always had to doubt that this will hold true on the new data. We did the fit by use of auto validation and model averaging as promoted by Phil Ramsey and Tiffany Rao in the Discoveries America 2020.   The R square with 97 and then R...Press R square cross validation R square of 90, it's just too good for us to believe it's true.   With that let's look at what Phil found when looking at the same data with PLS. Phil Bowtell So, as you say, we have this R squared of 97% with the press R squared of 90. All looks very nice. Let's just see how partial least squares compares with this.   So I've been looking at principal components analysis, partial least squares, which is a tool that we use when we have spectra or curves. It's commonly used because all our inputs are going to be highly correlated and   traditional regression techniques don't deal with that so well. And the first thing we noted was that, when we looked at the score plot that Bea had in the previous slide   on the demo, it looks almost exactly the same as you get in principal components analysis, so that's where we see some common links.   When I run the partial least squares, what I see is I get an R squared of 73%,   not quite as good as 97. And also, if you look at the observed against the predicted,   we do actually see what looks to be an okay fit, but then obviously Product B is having a bit of an impact and undue influence.   And in JMP and in SIMCA we've got cross validation measure Q squared, which is low at 33%. So this isn't really a good model.   This was done on the raw smooth data that we had. There are other transformations you can try, but really we weren't able to build a good model. It's certainly nothing that competes with the FDA.   However, one thing we do get from the model are coefficient estimates. We also get this quantity called VIP, and these in tandem give us an idea of which particular regions of the curves excite or tell us what's going on with the predictions. So if I just overlay   here the VIPs and the coefficients on the raw data plot, the green highlights areas where this is really having a big impact on the predictions, what's contributing towards the model.   The orange is medium, not so much. And the gray is low, and this is actually telling us that the first peak is really not having much of an impact whatsoever from prediction point of view.   So moving on, I'm looking at another set of data. This is the p data curves, and here we have these curves that have been collected.   Four conditions, maybe call them protocols or conditions, at three locations. We also have a fifth protocol or a fifth condition, but this is only taken at Location 1 and that's not plotted here.   And what we have is Location 1 on the left, Location 3 on the right and Condition 1 on top going down to Condition 4 at the bottom.   And one thing to note that these curves are quite similar. We do see some slight deviations.   But one question that was asked is, well, do we need all of these curves? Are they all needed? Or maybe we take a subset and use those to help us understand the data. So what I've done is taken all the products and sequentially plotted them. So I've got Location 1,   Condition 1, all the way up to Location 3, Condition 4 and plotted the data.   And we can see straight away that there are some common trends; we can also see some differences. So we all we always...we see that there are three products here that seemed to lie away from the others.   So we've got some product differentiation. If I look at the different conditions, I can see that, obviously, these products here are certainly changing as we move our change our conditions.   As I look at location, it doesn't seem to be a huge impact. Let's see if we can look into this in a little bit more detail.   So with any multivariate data, normally, the first thing you would do is just literally throw it into principal components analysis and see if anything comes out from that. So it's an exploratory data analysis tool.   And if I look at the score plots, I've taken all the data you've just seen, put it into the package, it's come up principal components, color coded by product.   And we can see straight away that products D, G and H seem to lie away from the rest of the products. We've got three products here, seven products over here.   And when we talked to the people that develop these products and make these products, it makes perfect sense. So it's good that the data are actually highlighting something we would expect to see.   I then highlight by the different locations, and I'm not really seeing a pattern here. I think you'd have to be quite adventurous to say there's something going on there.   However, when I color by the different conditions, I do see some pattern emerging. And if I look at the three products that we have here, D, G and H,   I can see, as you go from right to left, we're seeing a shift from Condition 1 to Condition 4, and likewise for the seven products here. Condition 5 is sat in the middle, and again, that's something we would expect because it's actually different measuring device.   So, from an exploratory point of view, we can see these differences. Let's see if we could look this from a more statistical points of view. And for the example I'm going to look at, I'm just going to focus on Location 3 and looking at the four conditions within Location 3.   To do this, I'm going to be using multiblock orthogonal component analysis, which is a bit of a mouthful, so it's just reduced to MOCA.   And, I'm also going to be looking at hierarchical modeling but I'm not going to be doing that...discussing that too much in the context of this talk, I'm going to be focused...focusing on the MOCA and these are two techniques that we find in the stats package SIMCA.   Now the idea here is that we look at blocks of data, and traditionally each block represents a different way of measuring some kind of chemical or some kind of product. So as an example we've got near infrared, infrared and raman spectroscopy.   And the two things that we aim to do with MOCA and with the hierarchical modeling is first of all, assess redundancy.   It might be that I just need near infrared and raman spectroscopy for the prediction and, in fact, if I know near infrared and I know the raman, I can actually predict the infrared or it's redundant.   So that's what going to be looking at, but one thing to note is when I talk about redundancy it doesn't automatically mean I can throw that particular block out.   Because on its own right, it may add to the prediction. So it's a balancing act between redundancy and predictability. As we've seen already, if I look at these charts here, it looks like there may be some redundancy.   So what's actually going on? Well let's think about this in terms of overlap. I've got Location 3 and I've got my four conditions. And to express overlap I've got a wonderful venn diagram and we can really start trying to understand what kind of information we've got.   The first point of information is globally joint information, which is where we have information that's common to all four conditions.   Then I can have information that's only common to two or three of the conditions, locally joint information, as highlighted in the orange here.   And finally whatever's left over, is the unique information, and this is what the different conditions bring to the party in their own right. So what might be expect to see here?   Well, the left hand side would indicate a situation where the four conditions actually have quite large amounts of independent information.   And we'll probably think, no there's not going to be any redundancy here, we have to keep all four.   The image on the right, where we have a large amount of globally joint and locally joint information and not so much unique information may indicate a situation where we could have redundancy, and it might be we get away with looking at one or two of the conditions and not all four.   In terms of what SIMCA does,   it does some modeling and we've got our four conditions and it's fitted four components, two of which are looking at joint components and two of which are looking at unique components.   Overall, we explain a lot of the variability of the data. We can see from the numbers at the top here explaining nearly 100% of the variability. That's a good thing.   The green bar is looking at the global joint informaton and we can see, these are really quite large; we've got a high number. So   this is just telling us there's a large amount of overlap between these four different conditions.   If I look at the locally joint information in orange, there is some, just between Conditions 2, 3 and 4. It's not huge.   And finally, we can look at the unique contributions that each of our unique information each of the conditions have, and that's quite small. So here we are pretty much certain there's going to be some redundancy.   We can also investigate where the uniqueness comes from in terms of products. So the size of the bubble here indicates whether it's unique or not.   Now, if we have no uniqueness, we get very small bubbles, so, for example, Products I and F are very small. We would expect, if we looked at these individually, just to have lots of green green bars here.   If we look at Product H, we see a little bit more independence. It's telling us there's a little bit independent information enough for conditions.   It's not a big bubble though, not at all, and in reality it looks like there's a lot of redundancy. So what was then done is we've taken all 13 possible conditions and locations, we've run the MOCA analysis and also the hierarchical model,   and from this, it's telling us, ideally, we need the first condition at Locations 1 and 3, and the fifth condition at Location 1,   which really sort of goes against what we've been saying. You know, we were saying earlier on, well,   you know the conditions differ a bit and locations don't differ at all. Well, not quite, there's obviously something going on in terms of predictions,   but we have been able to go from 13 different combinations down to four, with a fairly good model, R squared of 85.5   and Q square to 69.3. The cross validation measure isn't bad; be nice if these two are closer, but at least we are modeling our data here, in this case Yield 2, better than we have been able to in the past, using this is P data.   So with that, I shall pass back to Bea. Beatrice Blum Thank you, Phil, for this super interesting insights that you could provide. Now let's look what FDA does to the P data and what we can get from that.   Similarly to what Phil showed us, we can see that location as a factor, if you may say, is not having as much of an impact as the different four conditions are. So they vary the results much more.   And we can also see, that's quite interesting, when we look at the two best performing products over here with a 74 and 72,   that they are curves, the corresponding prediction curves, they look quite different, even much more different than what we saw in the K curves, maybe even.   So if we are going for the golden curve and try to figure out what is really the best performing profiles, this is giving us a hard time because it's quite difficult to find commonalities between these two curves.   So question is really, are there several golden curves or may an average curve between the ones that are already performing good be what we have to go after?   We can't answer that yet. We also can't answer if there are any redundancy or what location-condition combinations we really need to measure from this.   However, again when I extract the summaries from these FDA and I tried to build a model and predict Yield 2, I again get a super good model with an R square of 98 and press R square of 97. Yes, we do know already that this is too good to be true.   On the other side, we did try modeling the same Yield 2 with other extracted values from our curves as they were provided to us from the measurement department.   But trying to model those extracted values and to predict Yield 2 did not work out at all. The R squares we could achieve were not even close to the ones we get here.   So we do think something is going on and we made quite a huge step forward in trying to understand how we can model Yield 2.   So let's wrap up what we found. We've seen that both models, or both methods rather, result in very similar principal components and that they agree in terms of what commonalities they extract from the K and the P curves.   FDA might probably be a little easier to use, with little data prep. It allows us to predict curve shapes from Yield 1 and Yield 2 and that does give us some idea of what our golden curves may look like.   It also indicates what measurement practical factor seem more important to keep.   We are able to get really good models for our yields but do profoundly question these models.   At the moment we couldn't extract which location-condition combinations are most relevant to keep. That's something we would like to follow up with JMP.   PLS, on the other side, doesn't give us very useful information about which sections of the curves are most informative in terms of discriminating our products.   MOCA and hierarchical PLS additionally pinpoint us to the measurement protocols that we do need to keep for capturing the most relevant information from our P curves. The PLS models to predict yield appear somewhat more reasonable in terms of goodness of fit metrics than the FDA models did.   Our combined efforts helped us find clear patterns to differentiate our products. Both PLS and FDA enable us,   extracting essential features from the traces and calculating good prediction models for our yields. We also learned that aspects of the curves and what protocols are most meaningful.   We managed to model or predict our yields much better than we could in the past. That's a huge step forward.   This is a very good example actually where too many cooks did not spoil the broth. Both methods agree to a certain point, but they also both do give a different additional information where they differ from each other. The work is ongoing; we will expand from here.   Particularly,   we will add new independent data to validate or improve the models. We still need to fine tune which protocols we need to keep to give us the most relevant and the least redundant information.   That's something where we hope JMP will help, enabling that in the FDE platform. Our final goal is to understand what material composition of diapers   will result in which curve shapes and how those curve shapes relate to consumer yield. In our pursuit of the golden curve, we made good progress and are excited to eventually fully capture it.   With that, we thank you for your attention and are open to questions now.
André Caron Zanezi, Six Sigma Black Belt, WEG Electric Equipment Danilo da Silva Toniato, Quality Engineer, WEG Electric Equipment   Quality assurance and customer needs are rigorous terms that frequently refer to reliability. Improving products in terms of reliability challenges engineers in multiple ways, including understanding cause and effect relationships, and developing tests that reproduce customer conditions and properly generate reliable data without exceeding the product launch time deadline. Combining engineering expertise, historical data and lab resources, a design of experiment (DOE) was performed to quantify the product lifetime based on process, product and critical application variables. Performing several analyses using JMP tools, from the DOE platform to the Reliability and Survival modules, the team was able to describe the product lifetime as a function of its critical factors. As a result, an accelerated life test was established which is able to simulate years of product usage in just a few weeks, providing solid evidence of some specific failure modes. Standardizing its methods and procedures, the test became a crucial requirement to verify and validate new technologies implemented at WEG motors, optimizing the development process and reducing time to market.   This poster provides information about how we used JMP to analyze data and develop an accelerated life test. The project followed the step-by-step approach: Project charter: understanding the primary and secondary objectives, the multidisciplinary team was formed to share information and knowledge about, customer historical data, lab resources, motor reliability, cause and effect relationships, environmental application conditions and reliability data analysis. Historical data analysis: knowing and quantifying risks about analysing historical data, the team fitted some life distributions to understand Cycles to Failure (CTF) scale and shape parameters. Mainly, shape parameter refers to the failure mode to be reproduced in the lab tests, according to the Bathtub curve. DOE planning and analysis: in order to reproduce failures, the understanding about motors reliability was endorsed by cause and effect relationships, provided by a Fault Tree Analysis (FTA). The FTA was a source of critical variables combined into a Designed Experiment (DOE) to quantify how to accelerate product cycles to failure. Conclusions: as a result, DOE provided a surface profiler with the indication of the best condition to accelerate products life time. The accelerated test also provided a shape parameter, that when compared with historical data shows an overlap, meaning that the same historical field failures were reproduced under controlled conditions. Implementation: with an accelerated test, development and innovative process will became faster while providing important information about product reliability.     Auto-generated transcript...   Speaker Transcript André Zanezi So, hello, everyone. My name is Andre Zanezi. And as I am a Six Sigma Black Belt at Weg. And I'm here today in Discovery Summit to talk about the development of an accelerated lifetime test to demonstrate and quantify washing machine motors reliability.   We know that some...every company when they are developing new technologies, new solutions, they often   face some challenges when they have to improve their reliability in their product.   We face the same problem, the same challenges. And the project should analyze and understand our historical reliability data to quantify our historical reliability data.   And try to develop a procedure, an accelerated life test in our internal labs, labs to reproduce our failure...our viewed failure modes. Basically doing it, we could develop...   develop models in the first two way. So at the first step, we get some historical data from our motors.   And using reliability and survival modules in JMP, we fit some life distribution for our motors. And we know that   doing in fitting some life distributions as Weibull distributions, we can understand our motors reliability, our motors lifetime. And in JMP, we also can   use lifetime distributions and fit   different distributions for different failure modes. And we did it for four main different failure modes. And we compared it, we analyzed it and understand or understood our ...our motors lifetime, our motors reliability.   And doing it, we were capable to understand and to quantify our scale and shape parameters and it basically doing it, how much cycles was necessary to have a failure.   And also, according to the ??? to know which kind of failure modes we are facing. And we have...we we did also cross check with our internal...our validation KPIs, basically   plotting survival. We have survival plots and cross checking with internal KPI's to understand if the probability and the failure range was correctly with...if our data was reliable. And understanding all these failures modes, we could   we could develop an internal test to accelerate on...accelerated our internal ...our internal test. And basically to do it. We should understand the physics, the environment...   that environment conditions that will our motors are   working in. We did it through fault tree analysis, basically deploying   and understanding the the cause and effect relationships. Doing it, we could set the most critical variables in in this cause and effect relationships. And again, use JMP to do to design an experiment   in order to try to quantify the effect of some variables in our response in our cycles to fail. Basically, we were trying to reproduce field failures in our labs. And we did it, we will run several   tests, and as a result of our experiments, we could have...we could set and fit some using fit model...fits on models to our data and understand the relation of our   environment and motors variables to cycles to failure, and understand to the survival plot, and sort through the surface plot, understand the relation of some variables with cycles to failure,   and set specific point to accelerate our, our motors' lifetime. And again, running some some batch of samples, we could   set and fit lifetime distributions for our internal results, for our internal tests in accelerated life test. A   we were seeing some failures but at the end of this experiment, these accelerated tests, we we should ensure that we are facing and we are causing the same failures as we had in historical data.   So we come back to the lifetime distributions in survival and reliability in survival module.   And again, fit some Weibull distributions now for our internal...internal results for our accelerated lifetime tests.   And we noted that basically they shape parameter, the parameter of that, according to the best to means the filler mode and   we cross this information with our historical data and we can see that crossing both informations, we have an overlap   between the the shape parameter of internal test and the shape parameter of historical data. And it means basically that we are having...we are not reproducing the same failure modes in our accelerated life test.   And basically it means that now we can develop products in a faster way because every time when we have a new technology and new design, we can put it on this accelerated life test   and quantify if we are improving our motors reliability. And we can make it faster than before, and develop faster...develop products in in the faster way.   We also did some technical cross checks to to prove that we are facing in reproducing the same failures to to implement in this this test in into the development process so that that this was how we used JMP to provide a lot of information and put it on our internal test. It was made   by   really good teams. Please feel free to make some contact and send email if you have any questions. And that's the end.
Victor GUILLER, R&D Engineer, FUCHS Lubrifiant FRANCE   Functional data creates challenges: it generates a lot of measurements, sometimes with redundant information and/or high autocorrelation, sampling frequency may not be regular, and it can be difficult to analyze the information or pattern behind the data. One very common practice is to summarize the information through some points of interest in the curves; maximum/minimum value, mean or other points are commonly chosen. The study’s objective is to realize a mixture design for formulations containing up to three performance additives and analyze the results obtained from tribological equipment (friction coefficient vs. temperature). The first approach considered is to summarize the information through some values of interest: maximum friction coefficient and temperature at the maximum friction coefficient. This simple method enables us to find an optimal area for the formulation. When using the Functional Data Explorer, tribological curves are modelled through a splines mathematical model. The connection between the mixture and the FDOE Profilers enables us to explore the experimental space and predict the tribological response of any formulation. This new approach enables a holistic view of the relevant systems behavior, allowing for increased understanding of more complex interactions typically neglected by conventional evaluation. ERRATUM: At 05:30, the optimal mixture is composed of 70% of additive B and 30% of additive C (not additive A).                    At 17:40, we evaluate the influence of additive B (not additive A).     Auto-generated transcript...   Speaker Transcript Victor GUILLER Hello everyone, so the topic of today is to use Functional Data Explorer in the mixture design in order to predict tribological performance.   So the initial situation I will present you. So first the objective, we want to study and optimize the mixture of three performance additives   with different chemistries in order to find the mixture with the highest performance, so that means the lowest and most stable friction coefficient versus the temperature,   or the highest scuffing temperature. So the total amount of additive is fixed at 5% in oil in order to be able to spot differences between formulations   and to avoid too big differences in viscosity between the additives in formulation. So here we will talk about tribology,   which is the science and engineering of interacting surfaces in relative motions. It includes the study and application of the principles of friction, lubrication and wear.   So as a mixture design, we will do a simplex centroid design. So as you can see here with so with additive A, B and C,   we will use the dots here, the circle dots, for the...for doing the design of experiments is simplex centroid. And we will use the triangle points here   as validation point for the model and, if necessary, these points can be used for augmenting the design and so doing an augmented simplex centroid design.   In order to evaluate performances, we will use this tribometer here. It's a tribometer TE 77.   We will measure the friction coefficient versus the temperature, so it starts from from 40 degrees to 200 degrees   and with the contact points. So it's a ball on plate configuration under a certain normal load in an oscillating movement, as you can see on this figure. So the ball is isolated mechanically against the fixed lower plate and drive mechanisms run inside an oil bath.   We have some challenges regarding this test, because for each testing, we have 4,140 data points that are recorded by the software in the system.   So it creates particular challenges, because we generate a lot of measurements, so we may have some redundant information or high autocorrelation.   And we may also have possible irregular sampling time or sampling frequency.   And it can be quite difficult to analyze the information or pattern behind all of these points. So as an example, here is, as we can describe it, let's say a sea of data   with all the experiments we have made for this design of experiments and all the friction coefficient curves versus the temperature.   And the big question are, how can we use this data to predict something? So should we try to use only specific points from these curves or should we try to extract the relevant features from all the data, so from all the curves?   So the first approach is a traditional approach that we will only use some specific points.   If we look at friction coefficient curves versus temperature for the three additives alone here, we can spot some interesting points. For example,   if we look at Additive C, so the friction curve in blue, we can spot here a peak, a scuffing peak, so we can record the temperature at the scuffing peak and the maximal friction coefficient   recorded for this...for this peak. We can do the same for the additives so, for example, for Additive B, we have a higher scuffing temperature and also a   lower friction coefficient here, and we can also see that for some experiments, for example here when we look at Additive A, we need to record several values. For example,   we have here the first scuffing peak with a quite low friction coefficient, and then we have here, a second scuffing peak with a higher friction coefficient obtained. So we can use this information in the DOE   in order to use the specific points in the model.   So we will create a model with this specific points   so that there are four responses and we are able to identify an area of interest   in this mixture design. So we look here at the different responses, so we have quite a good modeling for all of these responses.   And when we look at the end at the mixture profiler, we can see here an area of interest that is approximately...so 70% of Additive B and 30% of Additive A. And in this area, we have the lowest friction coefficient and the highest   scuffing temperature obtained.   So what about the model accuracy and the predictive performances? So if we look at the differences between the experimental results and the prediction from the model and the augmented model,   we can see that, for all of the experiments, we have a very little difference from...between experimental results and prediction from the model. So   in this case, there is no use to include the validation point directly into the model because the simplex centroid design is able to to provide an   accurate enough prediction for the friction...for the temperature at the first scuffing peak here.   So as a conclusion, this first approach enables us to identify an optimal formulation area for performances, where we reach the highest scuffing temperature with the lowest friction coefficient,   and the formulation in the center of this area is composed of 70% of Additive B and 30% of Additive C.   When we do the same analysis approach with the Functional Data Explorer, we have a data table that looks like this, so we have here the ID corresponding to the   DOE experiment number.   Here, the proportion of Additive A, B and C and we record the temperature and we have the friction coefficient as a response here.   But first, what is Functional Data Explorer? It's a platform designed for data are that are functions, signal or series, and it can be used as an exploratory data analysis, so this is our case here,   or as a dimension-reduction technique.   In this case, we will...we will use the Functional Data Explorer in order to create a model and be able to predict friction curves depending on the temperature.   So, in order to do this we go on this data table on analyze, specialized modeling, Functional Data Explorer. We set the ID here as the ID function.   Then we have the temperature as our input and the friction coefficient as our output.   We have also information about the ratio of the additive, so we will save them in the supplementary variables, and we use the validation points from the mixture design, also as a validation here in our model.   And we click on OK, but what we can see first is that we have a lot of variability here at the beginning of the tests.   So first   what we have to do is to remove all the point here with very low temperature, because you won't have scuffing peak,   depending on the additives, but you have a lot of noise because it's initialization of the test. So in order to reach the first good temperature value, so around 40 degrees, you have to augment the temperature from room temperature to 40 degrees, and depending on the slope of this   temperature variation, it can introduce some noise. So first, on   the Functional Data Explorer, we will filter here on the temperature and we will only keep values that are above 45 degrees and then you can see, we will remove the noise at the beginning of the curves and we can set a model that will only explained variability depending on the additives.   So, then, we create our P-Spline model. So first we have to select how much knots we will have in the model. So in this case when we use 158 knots, we have the best   compromise in the modeling.   Looking at the diagnostic plots, we can see that we have some variation and higher residuals when we have higher friction coefficient, which is quite normal because it's unstable situation.   And the interesting part here is that...so JMP is able to provide me the model for all of the curves. For each curves it's composed of mean curve here and some function here   with coefficient linked to each of the function in order to express as accurate as possible all of the different friction curves.   And we also have the nice part that because we have introduced the variables, so Additive A, B and C,   we are able to change here the ratio of the additives in order to look.   So I will do it also here.   Just   opening the P-Spline already.   Done.   With 158 knots. Just to show you the interactivity between the rate...the additive ratio and the curve prediction from the model.   So we have done our filtering before, and so at the end here, I can move...I can have here, for example, more Additive A or decrease the Additive B and see how it impacts   my friction curve. The only problem of this profiler is that you don't have the constraint of the mixture design, so that means you can have more than 100% of the additive. So for example here, I can have 100% of Additive A but also Additive B and C.   But still, it helps us to to know the interaction between the additives and the friction coefficient outcomes.   If we are interested about the model accuracy and the predictive performances...sorry.   Model accuracy and predictive performance, so we can have a look, for example here on this curve. So you have the experimental data points in blue and the smooth curve in red from the P-Spline model and, if we look also at   what is our validation point, you can see that we have a pretty good match between experimental results and   the prediction from the P-Spline model.   So next what we want to do is to be able to screen through this experimental space from the mixture provider, but to have directly a link   to show us what the friction coefficient curves look like, depending on how...where we are in this experimental space. So first we have to go into the DOE data table, so this one,   and create an ID column here   that will match the ID column of the Functional Data Explorer in the Functional Data Explorer data table. So here each experiments from the DOE is linked...has the same number as the ID column from the Functional Data Explorer table, so like this one.   And then we have to, on the DOE table, we have to right click on ID and say link ID. And on the Functional Data Explorer table, we will do the same, link, but this time link reference, and we have to reference it to the previous data table, so from the DOE directly.   When we have done this, we can open the Fit Group script from the DOE tables, so the one with the mixture profiler at the end here.   And we will also open the Functional Data Explorer   analysis from the Functional Data Explorer table.   So the one i've already opened.   So it just takes some seconds to open it.   So I will put on the left   the mixture profiler here, and on the right   I will put the Functional Data Explorer analysis. And the last step we need to do is to link this two profilers, so I will go on the red triangle here, factor settings, and link profiler here.   Just make sure that it's the same here.   OK, so now when I've done this, so the two profiles are linked, so that means if I move here, I will just show you   the curves here in the bigger screen, so if I move here my points, then I will have the prediction of the curves of the friction coefficient curves directly on the right.   So it's a...it's a very interactive tool in order to visually explore the area. And in our optimal area, you can see that the friction coefficient is quite stable and at the same value here.   So here   we can now use a mixer profiler to screen the experimental space and see how the predicted friction coefficient curves look like for any point of the experimental space.   As a conclusion, so it's possible to determine and predict all friction coefficient curves in the experimental space and we have a better understanding of the influence additives. As an example here.   So I'm starting from the middle here, if I move there, so that means i'm increasing the level of Additive A. You can see, on the right   that we have a bigger slope here and highest scuffing peak   when I'm moving closer to Additive A only.   If i'm starting also from the middle and I increase the level of Additive C, you can see that the slope at the end of the profiler here decrease, but we have a very short scuffing peak here at around 80 degrees.   Starting from the middle and going to Additive A, you can see that we still have   a light slope here at the end, but no scuffing peak here. And when I'm in the area of interest here, I have the good aspects of the last two additives. That means   I'm able to have no peak around 80 degrees and to have a stable friction coefficient also in this area.   So   the mixture profiler and the functional design of experiment profiler.   The advantages of the Functional Data Explorer approach...first, you analyze all the data points, so even if specific points can have similar value,   the curve may behave differently. So here as an example, if you don't look at the friction coefficient value, you may say that you have very similar value because   you have the same scuffing temperature for the two experiments, but, as you can see, you have very different friction coefficient obtained for the two experiments here.   So second point is that you are able to do an objective analysis compared to a more subjective and domain-expert approach of selecting the right specific points and rule   as a responses for the DOE. If we take here an example, we may ask ourselves what are the correct values for the DOE response? Should we consider this graphing peak or this one?   Is this graphing peak too small or not to be considered? Should we consider this one? And is this graphing area also interesting   for us on it? So it may be quite difficult in some situation to only spot specific value or specific point of interest. And, last but not least, as we have seen,   it enables to have an interactive visualization and predictive modeling of the influence of additives on the formulation performances.   So, as the benefits for FUCHS, as this new approach allows an increased understanding of the complex interaction between the additives that may be typically neglected by conventional evaluation and we have the possibility to build more precise predictive models.   Thanks a lot for your attention.
Jordan Hiller, JMP Senior Systems Engineer, SAS Mia Stephens, JMP Principal Product Manager, SAS   For most data analysis tasks, a lot of time is spent up front importing data and preparing it for analysis. Because we often work with datasets that are regularly updated, automating our work using scripted repeatable workflows can be a real time saver. There are three general sections in an automation script: data import, data curation and analysis/reporting. While the tasks in the first and third sections are relatively straightforward – point-and click to achieve the desired result, and capture the resulting script – data curation can be more challenging for those just starting out with scripting. In this talk we review common data preparation activities, discuss the JSL code necessary to automate the process and demonstrate how you can use the new JMP 16 action recording and enhanced log to create a data curation script.     Auto-generated transcript...   Speaker Transcript Mia Stephens Welcome to JMP Discovery Summit. I am Mia Stephens, and I am a JMP product manager. And I'm joined by Jordan Hiller, who is a systems engineer. And today we're going to talk about automating the data curation workflow. And this is the abstract just for reference; I'm not going to talk about it. And we're going to break this talk into two parts. I'm going to kick it off and talk about the analytic workflow and talk about data creation, what we mean by data curation. And we're going to see how to identify potential data quality issues in your data. And then I'm going to turn it over to Jordan, and Jordan is going to talk about the need for reproducibility. He's going to share with us a cheat sheet for data curation and show how to...how to curate your data in JMP. the action recorder and the enhanced log. So let's talk about the analytic workflow. It all starts with having some business problem that we're trying to solve. And of course we need to compile data, and you can compile data from a number of different sources and bring the data in a JMP. And at the end, we need to be able to share results, communicate our findings with others. Now sometimes this is maybe a one-off project, but oftentimes we have analysis that we're going to repeat. So a core question addressed by this talk is, can you easily reproduce your results? Can others reproduce your results? Or, if you have new data or, if you have updated data, can you easily repeat your analysis, and particularly, the data curation steps on these new data? So this is what we're addressing in this talk. But what exactly is data curation? And why do we need to be concerned about it? Well, data curation is all about ensuring that our data are useful in driving analytic discoveries. Fundamentally, we need to be able to solve the problems that we're trying to address. And it's largely about data organization, data structure, and also data cleanup. If you think about issues that we might encounter with data, they tend to fall into four general categories you might have incorrect formatting, you might have incomplete data, missing data, or dirty or messy data. And to help us talk about these issues, we're going to borrow some content from STIPS. And if you're not familiar with STIPS, STIPS is our free course, Statistical Thinking for Industrial Problem Solving. And this is a course based on seven independent modules, and the second module is exploratory data analysis. And because of the iterative and interactive nature of exploratory data analysis and data curation, the last lesson in this model...in this module is called Data Preparation for Analysis, so we're borrowing heavily from this lesson throughout this talk. Let's break down each one of these issues. Incorrect formatting. What do we mean by incorrect formatting? Well, this is when your data are in the wrong form or the wrong format for analysis. This can apply to the data table as a whole. So, for example, you might have called...data stored in separate columns, but you actually need the data stored in one column. Or it could be that you have your data in separate data tables and you need to either concatenate or update or join the data tables together. It can relate to individual variables. So, for example, you might have the wrong modeling type. Or you might have dates in your data table, so you might have columns of dates and they might not be formatted as days, so the analysis might not recognize that these are...this is date data. Formatting can also be cosmetic. So, for example, if you're dealing with a large data table, you might have many columns... you might have names...column names that are not really recognizable that you might want to change. You might have a lot of columns that you might want to group together to make it a little bit more manageable. Your response column might be at the very end of the data table and you might want to move it up. So cosmetic issues won't necessarily get in the way of your analysis, but if you can address some of these issues, you can make your analysis a little bit easier. Incomplete data is when you have a lack of data. This can be a lack of data on important variables. So, for example, you might not have captured data on variables that are fundamental in solving the problem. It could also be a lack of data on a combination of variables. So, for example, you might not have enough information to estimate an interaction. Or you might have a target variable that you're interested in that is unbalanced, so you might be studying something like defects. It may be only 5% of your observations are for defects and you might not have enough data, for you know, you might only have a very small subset of your data where the defect is present. You might not have enough data to allow you to understand potential causes of defects. You might also not have a big enough sample size, so you just simply don't have enough data to have good estimates. Missing data is when you're missing values for variables, and this can take several different forms. If you're missing data and the data...the missingness is not at random, this can cause a serious problem, so you might have biased estimates. If you're missing data completely at random, this might not be a problem if you're only missing a few observations, but if you're missing a lot of data, then this can be problematic. Dirty and messy data is when you have issues with observations or with variables. So you might have incorrect values, values are simply wrong. You might have inconsistency. So, for example, you might have typos or typographical errors when people enter things differently. The values might be inaccurate. So, for example, you might have issues with your measurement system. There can be errors, there can be typos, the data might be obsolete. Obsolete data is when you have data on, for example, a facility or machine that is no longer in service. The data might be outdated, so you might have data going back a two or three year period, but the process might have changed somewhere in that timeframe, which means that those those historical data might not be relevant to the current process as it stands today. Your data might be censored, or it might be truncated. You can have redundant columns, which are columns that contain essentially the same information. Or you might have duplicated observations. So dirty or messy data can take on a lot of different forms. So how do you identify potential issues? Well, a good starting point is to explore your data and, in fact, identifying issues leads you into data exploration and then analysis. And as you start exploring your data, you start to identify things that might cause you problems in your analysis. So a nice starting point is to scan the data table for obvious issues. So we're going to use an example throughout the rest of this talk called Components, and this is an example from the STIPS course where a company is is producing small components and they have an issue with yield. So the data were collected, there are 369 batches. There are 15 characteristics that have been captured, and we want to use these data to help us understand potential root causes of low yield. So if we start looking at the data table itself, there are some clues to what kinds of data quality issues we might have. And a really nice starting point, (and this was added in JMP 15) is is is header graphs. I love header graphs. What they do is, if you have a continuous variable that show you a histogram, so you can see the centering and the shape and the spread of the distribution. They also show you the range of the values. If you have categorical data, it'll show you a bar chart with values for the the most populous bars. So let's take a look at some of these. I'll start with batch number. So batch number is showing a histogram, and it's actually uniform in shape. Batch numbers something that's probably an identifier, so just right off the bat, I can see that these data are probably coded incorrectly. I can see that this distribution is highly skewed and I can also see the lowest value is -6, and this can cause me to ask questions about the feasibility of having a negative scrap number. Process is another one. I've got basically two bars and it's actually showing me a histogram with only two values. And as I'm looking at these these column graphs, these header graphs, I can look at the at the column panel, and it's pretty easy to see that, for example, batch number and part number and process are all coded as continuous. When you import data into JMP, if JMP sees numbers, it's automatically going to code these columns as numeric continuous. So these are things that we might want to change. We can also look at the data itself. So, for example, when I look at humidity (and context is really important when you're looking at your data) humidity is something that we would think of as being continuous data. But I've got a couple of fields here where I've got N/A. If you have alphanumeric data, if you have text data in a numeric column, the column is going to be coded as nominal when you pull the data into JMP. So this is something right off the bat that we see that we're going to need to fix. And I can also look through the other columlns. So, for example, Supplier, I see that I'm missing some values. When you pull data into JMP, if there are empty cells for categorical data, we know that we're missing values. I can see that there's some some entries, where we're not consistent...not consistent in the way that the data were entered. So I'm getting getting some some serious clues into some potential problems with my data. notice all the dots. Temperature is a continuous variable and where I see dots, it's indicating that I'm missing values. So temperature is something that's really important for my analysis, this might be problematic. A natural extension of this is to start to explore data, one variable at a time. One of my favorite tools when I want to first starting to look at data is the columns viewer. And columns viewer gives us numeric summaries for the variables that we've selected. If we're missing values, there's going to be an N Missing column. And here I can see that I'm missing 265 of the 369 values for temperature, so this this, this is a serious gap here if we think temperature is going to be important in analysis. I can also see if I've got some strange values. So when I look at things like Mins and Maxes for number scrapped and the scrap rate, I've got negative values. And if this isn't feasible, then I've got an issue with the data or the data collection system. It's also pretty easy to see miscoding of variables. So, for example, facility and batch number, which should probably coded as nominal were reporting a mean and the standard deviation. And a good way to think about this is, if it's not physically possible to have an average batch number or (part number was the other variable) or part number, then these should be changed to nominal variables, instead of continuous. Distributions is is the next place I go when I'm first getting familiar with my data. And distributions, if you've got continuous data, allow you to understand the shape, centering, and spread of your data, but you can also see if you've got some unusual values. For categorical data, you can also see how many levels you have. So, for example, customer number. If customer number is an important variable or potentially important, I've got a lot of levels or values for customer number. When I'm preparing the data, I might want to combine these into four or five buckets with another category for those customers where I don't really have a lot of data. Humidity. We see the problem with having N/A in the column. We see a bar chart instead of a histogram. We can easily see what we were looking at in the data for supplier. For example, Cox Inc and Cox, Anderson spelled three different ways, Hersh is spelled three different ways. For speed, notice for speed that we've got a mounded distribution that goes from around 60 to 140 but, at the very bottom, we see that there's a value or two that's pretty close to zero, and this might be...it might have been a data entry error but it's definitely something that we'd want to investigate. An extension of this is to start looking at your data two variables at a time. So, for example, using Graph Builder or scatterplots. And when you look at variables two at a time, you can see patterns and you can more easily see unusual patterns that cross more than one variable. So, for example, if I look at scrap rate and number scrapped, I see that I've got some bands. And it might be that you have something in your data table that can explain this pattern. And in this case, the banding is attributed to different batch sizes, so this purple band is where I have a batch size of 5,000. And I have a lot more opportunity for scrap with a larger batch size than I do for a smaller batch size. So that might make some sense, but I also see something that doesn't make sense. These two values down here in the negative range. So it's pretty easy to see these when I'm looking at data in two dimensions. I can add additional dimensionality to my graphs by using column switchers and data filters. This is also leading me into potential analysis, so I might be interested in understanding what are the x's that might be important, that might be related to scrap rate. And at the same time, look at data quality issues or potential issues. So for scrap rate, it looks like there's a positive relationship between pressure and scrap rate. doesn't look like there's too much of a relationship. Scrap rate versus temperature, this is pretty flat, so there's not much going on here. But notice speed. There's a negative relationship, but across the top I see those two values; the one value that I saw on histogram, but there's a second value that seems to stand out. So it could be that this value around 60 is simply an outlier, but it could be a valid point. I would probably question whether this point here down near 0 is valid or not. So we've looked at the data table. We've looked at data one variable at a time. We've looked at the data two variables at a time, and all of this fits right in with the data exploration and leads us into the analysis. There are more advanced tools that we might use (for example, explore outliers, explore missing) that are beyond the scope of this course, or this talk. And when you start analyzing your data, you'll likely identify additional issues. So, for example, if you've got a lot of categories of categorical variables and you try to fit an interaction in a regression model, you know JMP will give you a warning that you can't really do this. So as you start to analyze data ,this this whole process is iterative, and you'll identify potential issues throughout the process. A key is that you want to make note of issues that you encounter as you're looking at your data. And some of these can be corrected as you go along, so you can hide and exclude values, you can reshape you can reclean your data as you go along, but you might decide that you need to collect new data. You might want to conduct a DOE so that you have more confidence in the data itself. If you know that you're going to repeat this analysis or that somebody else will want to repeat this analysis, then you're going to want to make sure that you capture your steps that you're taking so that you have reproducibility. Someone else can reproduce your results, or you can you can repeat your analysis later. So this is where I'm going to turn it over to Jordan, and Jordan's going to Talk about reproducible data curation. Jordan Hiller Okay, thank you, Mia. Hello, I am Jordan Hiller. I am a systems engineer for JMP. Let's drill in a little bit and talk some more about reproducibility for your data curation. Mia introduced this idea very nicely, but let's give a few more details. The idea here is that we want to be able to easily re-perform all of our curation steps that we use to prepare our data for analysis, and there are three main benefits that I see to doing this. The first is efficiency. If you have to...if your data changes and you need to replay these curation steps on new data in the future, it's much more efficient to run it once with a one-click script than it is to go through all of your point-and-click activities over again. Accuracy is the second benefit. Point and click can be prone to error, and by making it a script, you ensure accurate reproduction. And lastly is documentation, and this is maybe underappreciated. If you have a script, it documents the steps that you took. It's a trail of breadcrumbs that you can revisit later when, inevitably, you have to revisit this project and remember, what is it that I did to prepare this data? Having that script is a big help. So today we're going to go through a case study. I'm going to show you how to generate one of these reproducible data curation scripts using only point and click. And the enabling technology is something new in JMP 16. It is the enhanced log and the action recording that's found in the enhanced log. So here's what we're going to do, we are going to perform our data curation activities as usual by point and click. As we do this, the script that we need, the computer code (it's called JSL code, JSL for JMP scripting language) it's going to be captured for us automatically in the new enhanced log. And then when we're done with our point-and-click curation, all we need to do is grab that code and save it out. We might want to make a few tweaks, a few modifications, just to make it a little bit stronger, but that part is optional. Okay, so this is a cheat sheet that you can use. This is some of the most common data cleaning activities and how to do them in JMP 16 in a way so as to leave yourself that trail of breadcrumbs, in a way so as to leave the JSL script in the enhanced log. So it covers things like operating on rows, operating on columns, ways to modify the data table, all of our data cleaning operations and and how to do it by point and click. So it's not an exhaustive list of everything that you might need to do for data cleaning, and it's not an exhaustive list of everything that's captured in the enhanced log either, but, but this is the most important stuff here at your fingertips. Alright, so let's go into our case study using that Components file that Mia introduced and make our data curation script in JMP 16 using the enhanced log. Here we are in JMP 16. I will note that this is the last version of the early adopter program for JMP 16, so this is pre release. However I'm sure this is going to be very, very similar to the to the actual release version of JMP 16. So, to get to the log, I'll show it to you here. This looks different if you're used to the log from previous versions of JMP. It's divided into these two panels, okay, a message panel at the top and a code panel at the bottom. We're going to spend some time here. I'll show you what this is like but let's just give you a quick preview, if I were to do some quick activities like importing a file and maybe deleting this column. You can see that those two steps that I did (the data import and deleting the column), they are listed up here in this message panel and the code, the JSL code that we need for reproducible data curation script, is is down here in in this bottom panel. Okay, so that it's really very exciting, the ability to just have this code and grab it whenever you need it just by pointing and clicking is is a tremendous benefit in JMP 16. So in JMP 16, this this new enhanced log view is on by default. If you want to go back to the old version of the log, that simple text log, you can do that here in the JMP preferences section. There's a new section for the log and you can switch back to the old text view of the log, if you prefer. The default when you install JMP 16 is the enhanced log and we will talk about some of these other features a little bit later on, further on in our case study. Alright, so I'm going to clear out the log for now from the red triangle. Clear the log and let's start with our case study. Let's import that Components data that Mia was sharing with you. We're going to start from this csv file. So I'm going to perform the simplest kind of import just by dragging it in. Oh, I had a...I had a version of it open already. I'm sorry, let me, let me start by closing the old version and clear the log one more time. Okay, simple import by dragging it into the JMP window. And now we have that file, Components A, with 369 batches, and let's now proceed with our data cleaning activities. I'll turn on the header graphs. And first thing we can see is that the facility column has just one entry, one value in it, FabTech, so there's no variation, nothing interesting here. I'm just going to delete it with a right click, delete the column. And again, that is captured as we go in the enhanced log. Okay, what else? Let's imagine that this scrap rate column at near the end of the table is really important to us and I'd like to see an earlier in the table. I'm going to move it to the fourth position by grabbing it in the columns panel and dragging it to right after customer number. There we go. Mia mentioned that this humidity column is incorrectly represented on import, chiefly due to those N/A alphabet characters that are causing it to come in as a character variable. So let's fix that. We are going to go into the column info with the right click here and change the data type from character to numeric, change the modeling type from nominal to continuous. Click OK. And let's just click over to the log here, and you can see, we have four steps now that have been captured and we'll keep going. Alright, what's next? We have several variables that need to be changed from continuous to nominal. Those are batch number, part number, and process. So with the three of those selected, I will right click and change from continuous to nominal. And those have all been corrected. And again, we can see that those three steps are recorded here in the log. All right, what else? Something else a little bit cosmetic, this column, Pressure. My engineers like to see that column name as PSI, so we'll change it just by selecting that column and typing PSI. Tab out of there to go to somewhere else. That's going to be captured in the log as well. The supplier. Mia showed us that there are some, you know, inconsistent spellings. Probably too many values in here. We need to correct the character values. When you have incorrect, inconsistent character values in a column, think of the recode tool. The recode tool is a really efficient way to address this. So with the right click on supplier, we will go to recode. And let's group these appropriately. I'm going to start with some red triangle options. Let's convert all of the values to title case, let's also trim that white space, so inconsistent spacing is corrected. That's already corrected a couple of problems. Let's correct everything else manually. I'm going to group together the Andersons. I'm going to group together the Coxes. Group the Hershes. Trutna and Worley are already correct with a single categories. And the last correction I'll make is things that are, you know, just listed as blank or missing, I'll give them an explicit missing label here. All right, and when we click recode, we've made those fixes into a new column called supplier 2. That just has 1, 2, 3, 4, 5, 6 categories corrected and collapsed. Good. Okay let's do a calculation. We're going to calculate yield here, using batch size and the number scrapped. Right. And yeah, I realize this is a little redundant. We already have scrap rate and yield is just one minus scrap rate, but just for sake of argument, we'll perform the calculation. So I want that yield column to get inserted right after number scrapped, so I'm going to highlight the number scrapped and then I'll go to the columns menu, choose new column. We're going to call this thing yield. And we're going to insert our new column after the selected column, after number scrapped, and let's give it a formula to calculate the yield. We need the number of good units. That's going to be batch size minus number scrapped. So that's the number of good units and we're going to divide that whole thing by the batch size. Number of good units divided by batch size, that's our yield calculation. And click OK. There's our new yield column. We can see that it's a one minus scrap rate. That's...that's good and let's ignore, for now, the fact that we have some yields that are greater than 100%. Okay we're nearly done. Just a few more changes. I've noticed that we have two processes, and they're, for now, just labeled process 1 and process 2. That's not very descriptive, not very helpful. Let's give them more descriptive labels. Process 1, we'll call production process; and process 2, we'll call experimental. So we'll do this with value labels, rather than recoding. I'll go into column info and we will assign value labels to one and two. One in this column is going to represent production. Add that. And two represents experimental. Add that. Click OK. Good. It shows one and two in the header graphs, but production and experimental here in the data table. All right, one final step before we save off our script. Let's say, for sake of argument, that I'm only interested in the data...I want to proceed with analysis only when vacuum is off. Right, so I'm going to subset the data and make a new data table that has only the rows where vacuum is off. I'll do that by right clicking one of these cells that has vacuum off and selecting matching cells. That selects the 313 rows where vacuum is off. And now we'll go to table subset, create a new data table, which we will name vac_off. Click okay. All right, and and that's our new data table with 313 rows only showing data where vacuum is off. So that's the end. We have done all of our data curation and now let's go back and revisit the log and learn a little bit more about what we have. Okay, so all of those steps, and plus a few more that I didn't intend to do, have been captured here in the log. Look over here, we have...every line is one of the steps that we perform. There's also some extraneous stuff, like at one point I cleared out the row selection. I didn't really need to...I don't really need to make that part of my script. Clearing the selected rows, so let's remove that. I'm just going to right click on it and clear that item to remove it. That's good. Okay, so messages up here, JSL code down here. I'd like to call your attention to the origin and the result. This is pretty nifty. Whenever we do a step, whenever we do an action by point and click, you know, there's there's something we do that action on and there's something that results. So that's the origin and the result. So, for instance, when we deleted the facility column... well, maybe that's a bad example...let's choose instead changing the column info for humidity. The origin, the thing that we did it on was the Components A table, and we see that listed here as the data table. When I hover over it, it says Bring Components A to Front, so clicking on that, yeah, that brings us to the Components A table. Very nice. And the result is something that we did to the humidity column. We changed the humidity. We changed it to data type numeric and modeling type continuous. See that down here. So I can click here, go to the humidity column, and it...it highlights, it selects for us. JMP selects the humidity column for us. green for everywhere, except this one last result in blue. Well that's to help us keep track of our activities on different data tables. We did all of these activities on the Components A data table and our last activity, we performed a subset on the Components A data table, the result was a new data table called vac_off. And so vac_off is in blue. Right, so we can use those colors to help you keep track of things. Alright, the last feature I want to show you here in the log that's...that's helpful is if you have a really long series of steps and you need to just find one, this filter box lets you find. Let's say I want to find the subset. There it is. We found the subset data table, and I can get directly to that code that I need. Okay, so this...is this is everything that we need. Our data curation steps were captured. All that we need to do to make a reproducible data curation script is go to the red triangle and we'll save the script to a new script window. the import step, the delete column step, the moving the scrap rate step. All of those steps are are here in the script. We have the syntax coloring to help us read the script. We have all of these helpful comments that tell us exactly what we were doing in each of those steps. Right, so this is everything. This is all that we need and I'm going to save this. I'll save it to my desktop as import and clean...let's call it import and curate components. Right, that is our reproducible data curation script. So if I were to go back to the JMP home window and close off everything except our new script, here's what we do if I need to replay those data curation steps. I just opened the script file and we can run it by clicking the run script button. Opens the data, does all the cleaning, does the subsetting and creates that new data file with 313 rows. Let's imagine now that we need to replay this on new data. I have another version of the Components file. It's called Components B. It has 50 more rows in it, so instead of 369 rows, it has 419 rows. Imagine that, you know, we've run the process for another 50 batches and we have more data. So it's called Components B, and I want to run this script on components B. But you'll notice that throughout the script it'sit's calling Components A multiple times. So we'll just have to search and replace Components A and change it to Components B. Here we go. Edit. Search. We will find Components A, replace with Components B. Replace all. 15 occurrences have been replaced. You can see it here and here. And now we simply rerun the script, and there it is on the new version of the data. You can see it has more rows, in fact, it was off. It was 313 before, it's 358 now. Alright, so a reproducible data curation script that can run against new data. Okay, so here is that cheat sheet once again. This will be in the materials that we save with the talk, so you can get to this and find it. And this tells you just how to point and click your way through data curation and leave yourself a nice replayable, reproducible data curation script. That script that we made didn't require us to do any coding at all, but I'm going to give you just a handful of tips, four tips that you can use to enhance the scripts just a little bit. The first tip is to insert this line at the beginning of your scripts. It's a good thing to do for all your scripting. Just insert the line names default to here. This is to prevent your script from interacting with other scripts. It's called a namespace collision and you don't really have to understand what it does, just do it. It's a good thing to do. It's good programming practice. Second tip is to make sure that there are semicolons in between JSL expressions. The enhanced log is doing this for you automatically. It places that required semicolon in between every step. However, if you do any modification yourself, you're going to want to make sure that those semicolons are placed properly. So just a word to the wise. add comments. Comments are a way for you to leave notes in the program and leave them in a way so that you don't mess up the program, right. It's a way to leave something in the program that won't be interpreted by the JSL interpreter. And there are notes that the enhanced log has left for you...action recording in the enhance log has left for you, but you can modify them and add to them, if you like. So here are the main points about comments. The typical format is two slashes and everything that follows the slashes is a comment. So you can do that at the beginning of a line or also at the end of the line. So the interpreter will run this X = 9 JSL expression, but then it will ignore everything after the slashes. You can also, if you have a longer comment, you can use this format, which is a /* at the beginning and a */ at the end. That encloses a comment. So comments are useful for leaving notes for yourself but they're also useful for debugging your JSL script. If you want to remove a line of code and make it not run, you can just preface it with those two slashes and, if you want to do that for a larger chunk of code, you can use this format. So good to know about how to use comments. The last tip I'm going to leave you with is to generalize data table references. Do you remember how we had to search and replace to make that script run on a new file name, Components B? We had to change 15 instances in the data table, in the script. Wouldn't it be nice if we only had to change it at once, instead of 15 times? So you can make your scripts more robust by generalizing the data tables references. Instead of using the names, we'll use a JSL variable to hold those table names. Here's what I'm talking about. I'll show you an example. On the left is some code that was generated by action recording in the enhanced log. We're opening the Big Class data table. We're operating on the age column, changing it to a continuous modeling type and then we are creating a new calculated column, on open, we use it to perform the change on the age column, and we use it over here. Very simple, what you need to do to make this more robust and generalized. You need to make three changes. This first change over here, we are assigning a name. I chose BC; you can choose whatever you want. You'll see DT a lot. So BC is the name that we're going to refer to that Big Class data table by in the rest of the script. And so when we want to change the age age means BC data table and the age column in that data table. Down here, we're sending a message to the Big Class data table, and that's what the double arrow syntax means. So we're just generalizing that, we're generalizing it so that we use the new name to send that message to the data table. And now if we need to run this script on a new data table that's named something other than Big Class, here's the only change we need to make. We need to change just one place in the script. We don't have to do search and replace. Okay, so after those four tips, if you're ready to take your curation script to the next level, here are some next steps. You could add a file picker. It doesn't take much coding to change it so that when somebody runs this script, they can just navigate to the file they want it to run on instead of having to edit the script manually. So that's one nice idea. If you want to distribute the script to other users in your organization, you can wrap it up in a JMP add-in, and that way users can run the script just by choosing it from a menu inside JMP. Really handy. And lastly, if you need to run this curation script on a schedule in order to update a master JMP data table that you keep on the network somewhere, you can use the task scheduler in windows or the automator in the Mac OS in order to do that on a schedule. So in summary, Mia talked about how to do your data curation by exploring the data and iterating to identify problems. If you automate those steps, you will gain the benefits of reproducibility and those are efficiency, accuracy, and documenting your work. To do this in JMP 16, you just point and click as usual and your data curation steps are captured by the action recording that occurs in the enhanced log. And lastly, you can export and modify that JSL code from the enhanced log in order to create your reproducible data curation script. That concludes our talk. Thanks very much for your time and attention.  
Aurora Tiffany-Davis, JMP Senior Software Developer, SAS Josh Markwordt, JMP Senior Software Developer, SAS Annie Dudley Zangi, JMP Senior Research Statistician Developer, SAS   In this session, we will introduce an exciting feature new in JMP Live 16. You and your colleagues can now get notifications (onscreen and via email) about out-of-control processes.   We will demonstrate control chart warnings, from the perspective of: A JMP Desktop user A JMP Live content publisher A regular JMP Live user We will point out which aspects of control chart warnings were available before version 16, and which aspects are new.     Join us to learn: Which JMP control chart warnings platforms are supported. How to control which posts produce notifications. How to control who gets notifications. How to pause notifications while you get a process back under control. How to review (at a high level) the changes over time for a particular post.     Auto-generated transcript...   Speaker Transcript   Thank you for joining us today   to learn about a new feature in   JMP Live 16: Control Chart   Warnings. You may be thinking   Control Chart Builder has been   available in JMP Desktop since   version 10 and Control Chart   Warnings have been available in   JMP Desktop since version 10. So   what's actually new here? What's   new is in JMP Live version   16 we now have a way to grab   your attention if there's a   control chart post that has   warnings associated with it. In   other words, if there's a   process that might have a   problem. We do this through the   use of onscreen indicators as   well as active notifications   that can go out to users   on screen and by email.   I'd like to now introduce just   a few of the people who helped   to develop this feature.   I am Aurora Tiffany Davis,   senior software developer on the   JMP Live team, and during   today's demonstrations, I'll be   showing you JMP Live from the   perspective of a regular user.   We also have with us today Josh   Marquordt. Josh is a senior   software developer on the JMP   HTML5 team, and during today's   demonstrations he's going to   show you the perspective of a   JMP Live content publisher.   Finally, we have Annie Dudley   Zangi. She is a senior research   developer on the JMP statistics   team and she's going to be   demonstrating the control chart   features within JMP Desktop   itself. Annie, would you like   to get us started?   Thanks Aurora. Yes, so I'm gonna   be demoing this with showing   you how this works using a   simulated data set based on a   wine grape trial that happened   in California. So what we have   here is 31 lots, several   cultivars and yield, brix   sugar and pH. So let's start with   Control Chart Builder.   First, I'm going to pull in the   yield (that's in kilograms).   And then I'll pull   in the location.   Has a subgroup variable.   I don't care about...so much about   the limits. What I am concerned   about instead is whether or not   we have any particular lots that   are going out of control there.   Going above the limits or below   the limits. And I care about how   each of the cultivars are doing,   so I'll pull that into the phase   location, which basically   subgroups all the all the   different...or subsets all the   different cultivars for us. So we   can see that we have differences   and kind of unique things going   on with the different grapes.   Next I'm going to turn on   the warnings.   And to...the easiest way to do that   is to scroll down under the   control panel and select   warnings and then tests.   We're going to turn on Test 1,   one point beyond the limits, and   then Test 5 as well.   OK, I see no tests have failed.   That's pretty good.   And now you might recall we   were looking at two other   response variables, so I'm   going to turn on the Column   Switcher so we can look at all   three of them. We can just flip   through them using the Column   Switcher.   So we started with yield. We'll   take a look at sugar. Alright,   we can see the Aglianico has   very low sugar content, whereas   the other four have a higher   sugar content. And we can see the   different pH levels for each of   the five grape varieties. OK,   well we've got these 31 lots in.   I think we're ready to publish   it. Josh, would you like to show   us how to send that up?   Thanks Annie, so I have the same   report up that Annie just showed   you and I'm ready to publish to   JMP Live. The first thing I   would need to do as a new   publisher would be to set up a   connection to JMP Live.   If I go to file, publish,   and manage connections.   You can see that   I have a couple of connections   already created, but I'm going   to add a new one.   First you need to give   the connection a name just to help keep   track of multiple connections.   The next thing you need is   the URL of the server you're   trying to connect to,   including the port number.   Finally, at the bottom of the   dialog you can supply an API key   which says for scripting access   only. You only need this if you   are going to be   interacting with the server   using JSL, which we're going to   do later in this demonstration,   so I'm going to get my API key   from JMP Live.   I'm logged in.   I go to my avatar in the upper   right-hand corner and select   settings to see my user   settings. At the top there is   some information about my   account, including the API   key. I click generate new   API key and copy this by   clicking the copy button.   to my clipboard, then I can   return to JMP and simply paste   it in here and click next.   Authenticate   to   JMP Live. And you will be told that   your connection was created   successfully and you can save it   now. It is now present in my   list of connections and ready to   use for publishing. You only   have to do that the very first   time you set up the connection.   The next time you publish, you   can just use it.   So now I can go to file, publish   and publish to JMP Live.   And select my connection from   the dropdown at the top.   Create a new post is selected by   default. So I click next.   And this dialogue looks very   similar to what it did in 15.2,   except now there's an   additional checkbox here that   says enable warnings.   This is present for every   warnings-capable report.   If I hover over it, it says,   "Selecting enable warnings will   notify interested parties when   this post has Control Chart   warnings." I'll get back to who   the interested parties are in   a moment, but first I wanted   to explain what warnings-   capable reports are. In JMP   16 only the Control Chart   Builder is warnings-capable   and able to tell JMP   Live about warnings that are   present within it. There are   plans to expand to other   platforms in the future.   A Control Chart Builder can be   combined with other reports in a   dashboard or tabs report.   And it can be combined with   the Column Switcher as we're   showing in this example.   Some more complex scenarios   could cause an otherwise   warnings-capable report to not   be able to share warnings and   this enable warnings checkbox   would be gone.   For example, the Column Switcher   only works with a single Control   Chart Builder. If you try to   combine it with multiple control   charts in a dashboard, that   would no longer be warnings-   capable in JMP 16.   So back to who are the   interested parties? I, as the   publisher of the report,   am an interested party,   as well as the members of any   group I publish to, if that   group is warnings-enabled.   So I am going to publish this   report to the Wine Trials Group   and leave enable warnings   checked so that JMP will   tell JMP Live about any   warnings that are present.   The report will come up in JMP   Live. And the contents of the   report look much like they did   in 15 and 15.2   The points have   tooltips. You're able to brush   and select and the Column   Switcher is active, allowing you   to explore the results in   multiple columns. Now I'm going   to hand it over to Aurora so she   can show you some of the new   features in JMP Live.   Yeah, thank you Josh. So Josh   just published a post to JMP   Live that is warnings-enabled,   but that doesn't actually have   any warnings going on right now,   so I can show you what that   looks like from the perspective   of a regular JMP Live user   and what it looks like is not a   whole lot. There really isn't   anything to draw my attention to   Josh's post. There isn't any   special icon showing up on his   post. I don't have any new   notifications. If I open the   post itself, and I opened   the post details,   and I scroll down, I will see   that there's a new section that   did not exist prior to JMP Live   version 16. And that's the   warnings section. This section   is here because the publisher   said, by checking that   enable warnings checkbox in   JMP desktop at publish time,   the publisher is saying, I   think that other JMP Live   users are going to care whether   or not there are warnings on my   post. And so we have a warning   section here. But right now it   just tells us a very reassuring message there are no warnings   present. If we scroll down   further, we can see the   system comment that JMP   Live left in the comments   stream at publish time, and   again, this just tells   us a nice reassuring message this post has zero   active control chart   warnings. I'll pass it back   to Annie now so that she can   walk us through the next   step of the grape trial.   Thanks, Aurora.   So as I said before, we're   getting new data in. We had 31   locations before. Now we have 32.   The original study   was adding some some actual   restricted irrigation lots so   that they could find out how the   five different grapes responded   with restricted...with more dry   regions. So if we take a look at   the control chart with these   these restricted values,   we can see that the yield is   lower in this new...in this new   lot that was just added. And   in fact, with the Tempranillo   grape it is...it is below the   lower limit. We can take a look   at the sugar to see how that   responded and we can see that   the sugar actually went up for   our new restricted irrigation   dry spot.   The pH wasn't wasn't   anything abnormal.   So I think we need to   update this. Josh, do you   want to show us how?   Yes, so new in JMP 16 in JMP   Live is the ability to update   just the data of a report.   This is useful because you don't   need to rerun the JSL or   recreate the report in JMP and   republish. You simply want to   update the existing report with   new data. This can be done   directly from the JMP Live UI   by selecting details   and scrolling down to the data   section where you can view the   data table that is associated   with the report. And click manage   to update it.   Click on update data   and select update next to the   table you want to update.   And click submit.   You're returned to the report.   You will see that it is   regenerating and the updated   content shows the warnings that   Annie mentioned. Now I'm going to   hand it over to Aurora to   demonstrate some of the other   ways that JMP Live lets you   know you have warnings. Thank   you, Josh. OK, so Josh has taken   a post that was warnings-enabled   and now he's updated the data on   it so there actually are   warnings now, so I can show you   what that looks like from the   perspective of a regular JMP   Live user. We can see now that   his post looks a bit different   than it did before. It has a new   red icon on it that draws the   eye, and when we hover on that   icon it says there are control   chart warnings in this post.   What that's telling me in a   little bit more detail is, first   I know that the publisher of   this post cares about control   chart warnings because the   publisher has chosen to turn on   those tests within JMP desktop.   Second, I know the publisher   thinks that other JMP Live   users might care about control   chart warnings on this post   because that publisher has chosen to   enable that JMP Live feature.   And third, of course, I know that   there actually are control chart   warnings on the post. Now I'll   see this icon on any post. I   also see it on a folder if that   folder has a post inside of it   that fulfills all these same   criteria. If I click on this   icon, I am taken to   the warnings section of the post   details, just like I showed you   last time, only now there's more   interesting stuff in this   section. Now it tells me that   there are control chart warnings   and which columns those warnings   are present on (yield and brix   sugar) and it tells me some   details about the warnings. But   if I want more details, I can   scroll down just a bit and click   open log. That tells me a lot.   It tells me for every column   How many warnings there are;   what that translate to in terms   of warning rate; which tests   the publisher actually   decided to turn on an JMP   desktop; and also specifically   which data points failed   tests and which tests they   failed. I can also copy this   to my clipboard.   If I scroll down further to the   comments stream, I can see a new   system comment. It says the   posters regenerated because the   post content was updated, and   when the post content was   updated, there are now control   chart warnings on the following   columns. So you can see here   that these comments stream can   serve as kind of a high-level   history of what's been going on   with the post. Right now I'll   leave Josh a quick comment   saying it looks like reduced   irrigation had a   big impact. Now   the icon that I saw on the card,   that would be seen by any JMP   Live user, and any JMP Live user,   if they open the post details,   would see these system comments   and they would see this warning   section. But not just any JMP   Live user would get a new   notification actively pushed to   them, but I do have that   notification. I can see it up   here in my notifications tray.   And I also have one sitting in   my email inbox right now, and   it's very detailed. The email   contains all of the information   that is present when we saw open   log just a moment ago. Now, why   did I get this notification? I   got it because I'm a member of   the group that published...that   the post was published to. And   furthermore, the administrator   of that group has turned on this   JMP Live warnings feature.   They've enabled warnings for the   group itself, and by doing that,   the group admin was telling JMP Live I think the members of my   group are really going to care   about control chart warnings, so   much so that you should actively   push notification out to them if   we get any new control chart   warnings on the posts in this   group. In other words, my   group admin agrees with the   publisher. They both want to   draw my attention to these   potential problems.   Now I'll turn it back to Annie   so she can walk us through the   next part of the grape trial.   Thanks, Aurora.   OK, so we we last looked at the   adding of the restricted   irrigation lot and now we have a   couple new lots come in.   Nothing, nothing special about   those. Let's take a look at the   graph. Um, what do we see here?   Well, we see the restricted   irrigation, but nothing special   with those. Let's see if   anything happened with the   sugar. No, we see the two new   points at the end after the   restricted irrigation, but   nothing special there and not a   whole lot new. But we do still   need to update the graph and   update it on the web. So Josh, do   you want to show us how we can   update it this time?   So I have already   demonstrated how you could   update the data through the   JMP Live UI, but you can   also do this through JSL.   First, I'm going to declare a   couple of variables, including   the report ID. The report ID can   just be found at the end of the   URL after the last slash; it's a   series of letters and numbers   that identifies the report to   replace. There are ways to   retrieve the report ID through   JSL, which I will show in a   moment, but for now we're just   going to save that.   We're also going to update our   updated data set that Annie just   showed you so that we can   provide it to JMP Live. So if I   run these, it opens the data table.   The next thing we need to   do is to create a   connection to JMP Live.   This will use this the   named connection that we...   that I created at the   beginning of the demo,   Discovery Demo server, here.   I use the new JMP Live command,   which will create a JMP Live   connection object. I provide   an existing connection and   it can prompt if needed, but   I've already authenticated.   So if I run this,   I get a new connection. As I had   mentioned at the beginning, you   can use this connection to   search for reports, as well as   get a particular report object   by ID. I'm going to use our   variable that I pasted in   to get the report we've   been working on.   That report can be...   you can get a scriptable   report from that result   object to get a live report   that you can examine for a   number of pieces of information.   Here I grabbed a live report   and got the ID.   I got the title,   description and the URL.   And you can see in the log that   the ID I retrieved   matches the one that I   pasted in   to the report.   I also got the title.   The description is blank 'cause   we didn't provide one when we   originally published. And I also   got the URL, the full URL, that I   could use to either open the   report through the script or for   some other purpose, such as   creating a larger report   that links to it.   In preparation of the next step,   I'm just going to get the   current date and time, which I'm   going to use to decorate the   title a bit, prove that we've   updated it through JSL.   But the key command here is   the update data command, which   lets us update just the data   of the report, just like I did   through the JMP Live UI. It   takes the ID as well, which   here I'm going to retrieve   from the live report object.   And then takes the data command   which you provide the new   data table that you are   uploading, as well as the name of   the the current data table that   you want to replace.   That   update result object can also   be queried to retrieve a   number of pieces of   information, like if it was   successful, the status you got   back, any error messages which   could be useful in a more   automated setting to provide   details as to why publish of   the new data failed. So I'm   going to run this.   And it said that it was   successful. And if I bring up   my report, you get this popup   that says an updated version   of this report is available.   Now I can choose to dismiss it   and continue looking at the   current context I have, but I'm   going to say to reload.   And we see the new data points   here without having to refresh   the entire page.   We go back to   JMP.   The last thing I wanna do is   show that other pieces of the   report can also be manipulated   through JSL. Here I'm simply   going to give it a new title. I   don't like the one that was   provided by default, so I had   declared this variable with   a new title, and I'm going   to append to that, the date   and time to help   distinguish when this   update was done.   I'm going to use the set Title   Command to send that to the live   report and then close my data   table to wrap up.   Run these and   bring up the report. In   a moment you'll see the   title refresh both here   and in the details.   Here it is with the date   and time.   Now I'm gonna hand it back over   to Aurora so she can show you   more of what happened in JMP   Live with this update. Thank   you, Josh. So I can see Josh has   updated title on his post. And so   he has updated a post that was   warnings-enabled and had   warnings. He's updated with new   data, and the new data, and just   like the previous data, has   control chart warnings. So I can   show you what this kind of   persistent warning situation   looks like to a regular JMP   Live user. So I can see here that   the icon that draws the eye and   says there are control chart warnings in the post   that's still present in a   persistent warning scenario. If   I open the post,   and I open the post details, I   can see the warning section. Only   now, it tells me I have warnings on three columns yield brix   sugar and pH. If I scroll down   to the comments stream, I can   see that same notification about   the warnings here in the   comments stream. And I also, I'd   like to point out, have a new   active notification that is pushed   out to me. I have a new one here   and that's telling me that the   new data, just like the old   data, does have warnings   associated with it. Now I'll   turn it back to Annie and she   can take us through the next   step of the grape trial.   Thanks, Aurora.   So last we talked, we were   looking at Lot #34. 33 and 34   were added, so we've got   one new lot come in. That's lot   #35. Let's see how it looks.   Oh my goodness, the yield is way   out of...out of control. This is...   this is just unbelievable. This   is...this is just remarkable. How's   the sugar look? Well, the sugar   looks about normal, like we would   expect. The pH is also   about where we would expect.   This is something that's   clearly going to involve some   investigation, but we still need   to report this. Josh, would you   like to update the web?   So,   we've demonstrated that you   can update just the data of   the report, which is useful   when you want to keep the   report contents the same   and just update the data.   But there's also the ability to   replace the report which existed   before, and it's still useful if   you want to update the contents   of the report itself.   I realize in addition to   updating the data, I don't   really want to have this moving   range chart at the bottom. It   doesn't really make sense in   this context, so I'm going to   right click and say remove   dispersion chart and get rid of   that. So now the report is ready   to be replaced.   I got to file publish, publish   to JMP Live, and it looks like   it did before except instead of   selecting create a new post, I'm   going to decide to replace an   existing post and click next.   New in JMP 16,   we've updated this search   window.   My report is right at the top of   the list, but you also have the   ability to search by by keyword.   And...   and restrict the number of   reports if you've published a   lot, or this was a while ago and   you have difficulty finding it.   I'm going to pick the report I   want to replace and click next.   On this screen I get a   summary of the existing   picture and title. I'm going   to update the title,   just to draw attention to   the fact that I replaced it,   and give the description.   This time I know something might   be wrong with the yield. So   while the report does have   warnings, this time I'm going to   decide to uncheck the enable   warnings checkbox. Information   about the warnings will still be   sent to JMP Live   and be available at a later   time, but I don't want everyone   to get notified about the   warnings just yet.   Click publish.   And again, I'm told that my   report has been updated and I   can reload it.   And the new information for the   title and description appear in   the details. I'll hand it   back to Aurora so she can   show you what else has   happened in JMP Live.   Thank you, Josh. So just to   summarize, again, Josh has   taken a post that has control   chart warnings in it, but this   time when he republished it,   he decided not to enable the   JMP Live warnings feature.   I'm gonna show you what that   looks like to a regular JMP   Live user because the content   publisher has control over   whether their control chart   warnings are exposed on JMP   Live in a way that's going to   draw the attention of other   users. And Josh decided that   that attention really wouldn't   be productive right now. So what   does it look like to me? It   really doesn't look like a whole   lot. There is no icon on the   card to draw my eye to it. I   don't have a new notification.   If I open the post and I open   post details, and I scroll down,   that warning section that I've   showed you several times before,   it's not even present, because Josh has   said I don't think other JMP Live   users really need to know   about the state of the warnings   on this post right now.   Furthermore, if I scroll down to   the comments stream, you know, I   can go back all the way to the   beginning and I can see when it   was published, it did not have   control chart warnings and then   it was updated and it did. It   was updated again and it still   did. The most recent comment   that I see says Josh Marquordt   has republished the post, and   it doesn't tell me anything   one way or the other about   control chart warnings. And   again, that's because the   publisher has control over   whether these things are   exposed to other JMP Live   users.   While I'm here, I'll leave a   quick comment because I see in   the description that Josh wants   us to look at the yield.   And it looks very, very   off to me, so I'm going to say,   could this be   A data entry error?   Oops, that was my scroll mistake   and I'll submit that and then   I'll turn it back over to Annie   so that they can do some   troubleshooting on this process.   Thanks, Aurora. So we went back   and we talked with the data   entry people and it turns out   they were entering in pounds   instead of kilograms. As you   notice right here, we're in   kilograms. So we updated the   data, did a little division   on it, and now the yield   looks like more like what we   would expect. The sugar and   the pH have been unaffected.   Josh, would you like to   show how to republish?   Yes, so we've shown several ways   to update the data. I'm going to   go back to the first way I   replaced it by updating it to   the the JMP Live UI. I'll click   on details. And scroll down   to the data section   again, click manage.   Update data.   And when I click update I'm   going to select the fixed   data that Annie just   presented and submit.   Go back to the report, see   it regenerate.   And   like we noted the   yield is back to looking   normal. I'm going to leave   a comment for Aurora to let   her know that we fixed the   ...the units.   Then hand it back to her to show   you what has changed in JMP   Live. Aurora... Thank you, Josh.   So I can see his post here. I   can open it and right away   looking at the report itself I   can see that things look a lot   better on the yield. So I'm   curious about what that what   that was. I'm going to scroll   down here and actually I can get   here because I notice I have a   new notification. What's that   about? I click on it and I see   that Josh has replied to my   comment; that will take me   directly to the post also. And   if I scroll down to those   comments and I look at that   reply, I can see, OK the units   were in pounds instead of   kilograms. It's been fixed now.   Fantastic. So it looks like the   grape trial is back on track   and we're making good progress.   Um, I'd like to take a step back   now and talk about the different   kinds of JMP Live users that   there are and how they interact   with control chart warnings.   We've talked a lot during these   demonstrations about the power   that Josh had as the content   publisher. The content publisher   has control over which tests are   turned on or not in JMP   desktop. And the publisher   also has control over whether   or not to enable this JMP   Live feature on the post.   But before when I got a   notification about control chart   warnings, I mentioned that I got   it because the post is published   to a particular group. So I'd   like to show you a little bit   more about those groups. If I go   to the groups page, I can see   the Wine Trials Group that this   post has been published to, and   I can see that it is warnings-   enabled. If I hover over that, it   says control chart warning   notifications will be sent to   members of this group. Let's   open that group up.   You can see here as well that   it's enabled and because I   actually happen to be the   administrator of this group, I   can change that. If you come over   here to the overflow menu,   which is these three dots,   click that, and I have the   option to disable warnings and   stop sending these   notifications out to my group   members.   I can also change it bac. If I   change it from disabled to   enabled, then I get a prompt and   it says send notifications now.   JMP Live is telling me,   OK, you've got a group;   it's got some posts in it;   because you didn't care   previously about control chart   warnings in this post there   could be posts in this group   already that have warnings and   none of your members know   about it. So now that you do   care about control chart   warnings in this group, would   you like me to go ahead and   send out notifications to all   of the members of the group   about any control warnings   that already exist on the   posts in here? I'll say no for   now because we already know   about this particular problem.   But what if I'm not a content   publisher and I'm not a group   administrator? I'm just a   regular JMP Live user and I'm   getting notifications about   other people's processes. As with   any other kind of notification,   I can opt out. And I would do   that by going up here and   clicking on my notification bell   icon, clicking on the settings   icon. And if I scroll down, I'll   see that there is a new type of   notification called control   chart warnings. I can toggle   this on or off to say whether or   not I want these notifications   at all. And if I do,   I can let JMP Live know with   what frequency I want to receive   emails about these   notifications. I think that   Josh also has some closing   thoughts for us, so I'll   turn it over to him, Josh.   Thanks, Aurora.   So we demonstrated the new control   chart warnings in JMP Live 16,   how it lets you notify   interested parties about tests   that generate warnings in   Control Chart Builder.   We've shown some new features in   the JMP Live UI that draw   attention to the warnings and   give you details about what   occurred. And settings to control   the notifications and warnings   from the perspective of both the   publisher and group admins.   We've also shown that there's   several ways to update   reports and get data into   JMP Live 16. You can publish   a report from the JMP   desktop.   You can update just the data,   which is a new feature in JMP   Live 16, through both the JMP   Live UI, as well as updating just   the data through JSL.   And you can also still   republish a report   from the JMP desktop   to change its contents.   I only briefly touched on the   JSL capabilities in JMP Live   16 so if you're interested in   more details or on how to take   this process and automate it,   please see Brian Corcoran's   talk on the JMP community at The Morning Update Creating   An Automated Daily Report to   Viewers Using Internet-Based   Data. It takes a control chart   warnings example and shows how   you might make this a daily   process that publishes   automatically.   Please see our talk on the JMP   community and leave us feedback.   Finally, we wanted to say thank   you. We are just a few members   of a much larger...several teams   that have worked on this   feature. On the JMP   desktop in Statistics, Annie   Dudley Zangi and Tonya Mauldin   worked on Control Chart Builder.   The JMP Live team led by Eric   Hill contributed to both this   feature and many of the other   features that we got to   indirectly show while giving   this demo. The JMP   Interactive HTML team led by   John Powell created the content   of the reports of control chart   folder in JMP Live.   Our UX and design work is done   by Stephanie Mencia and our project   manager is Daniel Valente.   Thank you.   Thank you everyone. Thank you.
Joshua Lambert, Assistant Professor, University of Cincinnati   During the process of building a regression model, scientists are sometimes tasked with assessing the effects of one or more variables of interest. With an additive regression model, effects (e.g., treatment) are assumed to be equal across all possible subgroups (e.g., sex, race, age). Checking all possible interaction effects is either too time consuming or impossible with a desktop computer. A new JMP add-in implements an algorithm, the Feasible Solutions Algorithm (FSA), which is meant to explore subgroup-specific effects in large data sets by identifying two- and three-way interactions to add to multivariable regression models. This talk gives a short introduction to the FSA , explains how I moved my R package to JSL, and provides a tutorial on how to use the FSA JMP add-in to explore interactions in your own regression models.     Auto-generated transcript...   Speaker Transcript Joshua W Lambert Hello. ROBYN GODFREY hi I'm Robin me josh.   Yes. ROBYN GODFREY The neo. Joshua W Lambert yeah nice to meet you too sorry I'm just opening up the PowerPoint here. ROBYN GODFREY Okay.   Do you have any co presenters. Joshua W Lambert No. ROBYN GODFREY All right, just going to go down the list here, so we are recording this so just so you verbally give me a verbal affirmation that you know you're being recorded.   Yes. ROBYN GODFREY cell phones Internet notifications are silenced. Joshua W Lambert Yes. ROBYN GODFREY yeah I think I'm the like the windows taskbar and all that and sure it's all like your email go off in the middle. Joshua W Lambert And don't I think I've got everything turned off they're not ever had to do that before so. ROBYN GODFREY I think it's in the taskbar um.   let's see here.   Just close your email and stuff it'll be okay.   Okay. Joshua W Lambert got the tacos. ROBYN GODFREY Okay, great I'm gonna hide myself here.   To do this, every time I do this, I have to, I have to like Okay, what did I do. Joshua W Lambert So then.   As far as the presentation goes I guess you guys will cut from when I start.   yeah and what you have any questions that you're going to ask or. ROBYN GODFREY No, no, so you'll just do your presentation like I'm you normally would.   right and just you know you bring up your slides and everything.   If you want to talk first and then bring up your slides that's fine if you need to start over or anything like that just let me know and we can we can do that no big deal.   Okay you're you're a little clip like you're the camera, is there any way to back up just a little bit cuz you're like really close.   I can't even like see your whole face, I can see like. Joshua W Lambert to know I forgot. ROBYN GODFREY yeah just a little bit.   Or maybe lower it just a little bit your camera.   that's good.   that's good yeah, we can see more of your shoulders and stuff okay.   So I'm going to go on mute and so just give me like a couple seconds, and then you can.   start when you're ready okay.   Great. Joshua W Lambert Hello, everyone. My name is Josh Lambert. I'm an assistant professor at the University of Cincinnati and I've been a JMP user for about 10 years now.   And I'm excited to share with you some work I've been doing around exploring interactions in regression models using an add-in I built that uses an algorithm I developed called the feasible solution algorithm.   So   some contents that we're going to talk about today. So I'm going to start off with a little example, which will motivate our discussion, as well as what I've been working on, an overview of the problem.   A potential solution.   And then discuss how I've implemented the solution into an R package and then moved that over to a JMP add-in and the process I took doing that. And then I'll talk about some future endeavors and some things I learned along the way.   So let's motivate our discussion today with a little example. I'm going to call this Tom the Data Scientist. So meet Tom. He's a data scientist and Tom does a lot of typical data science activities, specifically Tom builds multivariable regression models in JMP.   He mostly deals with tabular data that has many variables, and Tom realizes that these multivariable regression models lack complexity, specifically they lack interaction terms and quadratic terms.   And Tom...that frustrates Tom, and Tom doesn't want to go right into machine learning.   So he wishes that he had a way of exploring interaction effects in his regression models.   he has interactions he'd like to test, but he would have to handcraft those into the fit model platform in JMP.   And you know, having to run all of these is going to take a lot of time. So, for instance, if Tom has 200 variables in his data set and he needs to look for all two-way interactions, that's 19,900 two-way interactions to check. That's a few too many.   So Tom wonders, is there an algorithmic and data-driven way to explore interaction effects in regression models without needing to handcraft them all by hand and the need to check all these possible combinations?   So let's overview this problem in a little bit more of a mathematically and statistically rigorous way. So what we have in the problem is, we have a problem of volume and complexity.   So volume and complexity   is going to continue to grow at a really fast rate. We have more and more data coming available to us all the time, and the complexity of these data are also growing.   The problem with interactions are is that there's just too many of them to check.   Usually, a data scientists, statistician, scientist is going to have to walk in to the analysis with a known set of interactions to check   and then proceed to check them. This is a problem for a number of ways...a number of reasons,   which are that you may not have a good idea as to what interactions may or may not exist, and you'd like to be able to explore them in your data   without having to go to models, such as random forest or even principal component analysis for doing some sort of data reduction, doing some more complexity there. You'd really like to be able to do things along the way as you build a regression model.   So the other problem with random forest and principal components, as well as other machine learning types of tasks, is that these things are often difficult to interpret and we'd like to add...keep the interpretability of regression models while we're exploring interactions.   So typically our workflow in regression   so the statistician spends a lot of time building a parsimonious, what I'll call, base model with the necessary variables and no interaction effects added.   This base model, the statistician will spend many...much time on this base model, many resources, care, spend time thinking about whether or not they're interpretable and if they'll make good contextual sense.   The problem with these base models that don't include complexity is that they assume that the effects...that the effects that they're estimating are consistent across all possible subgroups like sex, race, age, and that just typically isn't true.   This lack of complexity   really prevents people from being able to use it, as compared to say a machine learning model, which does a really good job of modeling this complexity. So the problem can really be summarized in the following way.   Is there a way that we can not be all the way at traditional multivariable regression models and not all the way to machine learning models,   but find a nice sweet spot in the middle, where we're able to take the nice interpretability of our regression models that add in a interaction or two   that we found in this big data, so that we can add interesting nuance to our models and complexity, that, again, add predictive performance as well as good contextual sense.   So if there is a way of doing that, what we'd really like to do is to develop a tool for statisticians, data scientists, and investigators to be able to explore the interactions after they build a base model.   the base model happens first, and then the complexity exploration usually would happen second.   So we have some constraints and preferences if we were to develop a tool like this. We would like there to be, based on traditional statistical models linear logistic regression,   we'd like them to remain interpretable. We want to check fewer models, if possible, and we'd prefer feasible over optimal. I'll get a little bit more into what that means later, but in essence, really what it means is that   we would like there to be a plethora of solutions rather than just one single one that are good and what we'll call feasible.   And we'd also like it to be flexible. It could be adapted to be able to work with linear regression, logistic regression, Cox proportional regression. We could use it for Poisson regression, any type of regression, this framework or tool could be able to be used for.   And again it's going to be hybrid between traditional statistical methods and machine learning. The results that are going to come out aren't necessarily going to be inferential, but they are going to be exploratory and they'll motivate and influence what we future spend our time doing.   So let's now talk about this potential solution, the feasible solution algorithm.   The algorithm...the feasible solution algorithm sometimes called FSA, or that's what I like to call it,   was first discussed in this paper...really first discussed in detail in this paper that I wrote in 2018.   And   I'm going to summarize what the algorithm does here.   So the goal of the algorithm is to identify interactions of order m (so that would be if m was 2, that would be a two-way interaction; if m was 3, that would be a three-way interaction),   with a feasible criteria, so that's a criteria that is not necessarily the best criteria, but it's a   criterion that is   maybe considered semi-optimal in some way.   So to do this, we would follow the following steps. So the first thing we're going to do is start off with a random interaction of order m, so for our case, let's assume that to be a two-way interaction, so m is 2.   We're going to consider all exchanges of one of the variables and the interaction for all the other variables. So for instance, let's say we randomly start it.   We have five variables, so this is just a small example, just to kind of motivate the steps here, but let's say we have five variables and we randomly start at X3, X5 and our criteria, which is R squared for that random starting place, is .5.   And let's say the way that the algorithm works is it would consider exchanging one of the variables, X3 or X5, for any of the others. So our choices then, are   X3 X1, X3 X2, X3 X4, X5 X1, X5 X2, and X5 X4. So notice that of all the possible, what I'll call, swaps have at least one of the starting place variables in them, okay?   And then, what we're going to do is of all of these, we'll fit those models and we'll figure out what is the criteria for those models. So what we can see here is that   for the X3 X5 model, we have .5; X3 X1, the criterion is .4.   And with R squared, we obviously don't want to go to a worse place, we want to go to a better place, a higher R squared, so we would find out of all the possible choices, what's the best place to go to.   In this example, the best place to go to is swap number three, which is X3 X4. So what we would do is we'd move on to step three and we would make the best exchange from step number two.   In that case, we would move to place X3, X4 and then we would return to number two and repeat until no improvements can be made.   So we would repeat this process, moving to places, starting there, considering all the swaps, until eventually we can't make an improvement.   And we're going to call that in...that place that we end up a feasible solution.   And we're going to repeat steps one through four to find other feasible solutions. We can do this over and over again, and this process, this feasible solution algorithm, isn't guaranteed to give you the optimal solution, although it can give you the optimal solution, some of the times.   So let's talk about some of the byproducts of using this algorithm, the feasible solution algorithm. And these are outlined in a paper that   Elliott, a colleague of mine wrote in 2021, where she describes that feasible solutions are not guaranteed, as I just said earlier, to be optimal for a chosen criterion.   And that is to say that all optimal solutions are feasible, but not all feasible solutions are optimal. So feasible solutions are a type of semi-optimal solution, they give you a good criteria, feasible criteria, but not necessarily optimal ones.   These feasible solutions criteria are typically very close to the optimal ones, though, so they tend to be pretty good. They just might not be as good as the best one.   Many...if you repeat the process, this...these these four steps, you will get potentially many feasible solutions, so if you do this algorithm 10 times, you might get four feasible solutions.   You might get 10 feasible solutions. It depends on the data that you're using, as well as the variance in covariance of that data set, so it's a little bit   undetermined walking in as to how many solutions you're going to get.   And that's why we usually encourage users to repeat the feasible solution algorithm many times, because that's going to increase your chances of getting the optimal ones, as well as the feasible ones, and make sure you've   adequately searched the space. And we have another paper out that describes that (if anybody's interested, you're welcome to reach out to me) as to how many random sorts should I do to make sure I have a reasonable probability of getting the optimal one. We have a   theoretical paper about that as well.   But, in essence, the way that this works is that some of these interactions are more attractive. They...the beta space attracts them more than others, and so what you end up getting some of the times is that, even though   an interaction is not the optimal one, it can sometimes lead...the data can often lead it   through the feasible...feasible solution algorithm can oftentimes lead you to that place more often than the optimal one, and that just has to do with how the data are correlated and whatnot.   So let's talk about the R package and then how I moved that to the JMP add-in.   So   the R package is called rFSA, and I know that we might have some people that have used R quite extensively and some that haven't. And   So R is a really nice programming language and it's often taught in statistics programs, and so it was the first place I immediately went to when I was working on my dissertation   to, you know, write up this algorithm and to really provide a tool to the community as to being able to identify and explore interactions in their own data sets.   And so the rFSA package implements this feasible solution algorithm I just talked about   for interactions in large data sets, and this package supports the optimization of many different criteria like R squared, adjusted R squared, AIC   interaction, P value, so forth and so on. And it also supports different modeling strategies, like linear models and generalized linear models, and can be easily adapted to work for other types of modeling, things like Cox proportional hazard models.   And it also gives multiple solutions as it repeats the algorithm as many times as you specify.   So why...I want to talk a little bit about my motivation about moving this package to JMP, and I'm going to talk about the why, the when, and the how I was going to do this.   So the first question is, why do this? The first thing is it's fun to do. I like writing in different programming languages. I had not had a lot of experience writing in the JMP scripting language, which is JMP's version of   statistical programming language, and so I wanted to learn that and I thought it would be fun to move this R package that I spent the greater part of four years on, and try to move it over to JMP, because   I really like JMP. I think JMP is great. It's a really great tool for me and my data   analysis pipeline. I usually start off all my projects in JMP and explore the data, I plot the data, I look at things and then that gives me a lot of good intuition about those data. And I found that a lot of my   colleagues, specifically one of my advisors, he primarily works in JMP and he was always asking me for the FSA package in JMP.   And so I thought, you know, why not give it to him? He surely deserves it, so i decided that would be a good tool and that hopefully, other people would   get some use out of it. And that leads me into my other point about why to do this, and I've gotten a lot of great feedback about rFSA.   Specifically, I've been tracking how many people have downloaded my R package and there's over 16,000 people who've downloaded it since we put it out there in 2019.   I've gotten countless emails from people all around the world, and I want the same thing to be accessible for people in JMP, and I hope that through this add-in, people are able to find really cool interactions that change how they interpret data and how they understand it.   And then you might ask, well, when was I going to do this? This isn't exactly something that somebody's paying me to do. You know, JMP's not paying me to do it.   My current position, while I think that they would find it to be interesting, they're not exactly probably gung ho on me spending a bunch of extra time doing this.   But luckily, one thing I do have built into my position is some free time on Fridays. I try to leave Friday afternoons or Friday mornings open to just fun,   fun things that have to do with my job. And I call it Fun Friday Free Time, and that's what I decided to do for the last few months was take my Fun Friday Free Time and spend it building a JMP add-in.   You might ask, well, how was I going to do this? You know, I didn't know JMP scripting language. How was I going to go about doing this? So   the first thing is is, you know, I was going to learn it. So there's a lot of really great resources out there about how to learn JSL.   So there's JMP JSL code support that is within the actual JMP software itself. Just go to help, you can right...go right into   the scripting index that's there. There's countless things that are online   on how to understand JMP scripting language and where to get started. And then there's the community.jmp.com,   which is really great for getting started, where you can ask questions or view other questions that people have asked and borrow the code that they have posted up there publicly for being able...for people to be able to enjoy. And I did that a number of times here, and it really was great.   So I'm going to kind of go through each one of these a little bit more in depth, just so if you're interested in moving an R package over or writing your own algorithm or writing your own JSL code,   you'll have an idea as to where to get started. And then, when I'm done with this I'm going to go into my add-in and specifically what it does and how to use it.   So how do you learn JSL? Well, you can again...   you can learn it through a number of JMP's JSL resources that they have. So they have a scripting guide that's 864 pages, which is linked here,   that you can Google, or you might be able to get these slides after this is over, and be able to get these links. There's a scripting index within JMP, which you can just go to help and then scripting indexes. I put here for you guys, really easy to use and get started.   You can contact JMP. So I first thing I did is I had a contact at JMP. Her name is Ruth Hummel and I said, hey I want to do this, where do I start?   She gave me support and encouragement about the idea. She thought it was great and then she connected me with a JSL code expert,   whose name is Mark Bailey. And Mark was tremendous through this whole process. Mark helped me with just general support around the scripting language, reviewing my code, helping me write parts of it,   get started on part of it. And we took...we went back and forth about 22 different times through email over the last few months. And   the resources that JMP provided me, as far as direct employees who were willing to help me with my project   were tremendous. I mean, I couldn't have asked for anything better. I mean, I've never received this type of support when I was trying to create something for any other platform before. So kudos to JMP for   providing this, and they're the main reason why this exists; it's because they provided fantastic resources.   So community.jmp.com, you can ask questions there, you can get answers, you can borrow code, you can get certified there in   JMP scripting language. You can search the Community for anything you want, you can...it's just like a Google, but it's just for JMP.   And you can search anything you want, so you can type in JSL and JSL whatever you want to do, and it's probably somebody out there that's already posted about that. If there's not, you can add that to the discussion board.   And then there's borrowing code, and this is one of the things Mark passed on to me was, hey, borrow code. There's a lot of code out there on the Community website,   and people have shared it for a reason. So I...on the right here is actually something that I borrowed for my add-in that I developed. I wanted users to know where they were   in terms of running the algorithm and how much longer it was going to take or how much progress had already happened.   And so I didn't want to write my own progress bar in JMP, so I just borrowed one that was on the Community board and added it straight into my add-in, so yeah.   This person, Craige Hales, who is a retired staff of JMP, wrote this code and I borrowed it. So thanks, Craige; Thanks, Mark for recommending borrowing the code. It saved me a lot of time, so I really appreciate it and it made the add-in a lot better.   So now let's talk about the add-in finally. So I've talked to you about why this add-in is needed, right. Others have gotten use use out of the R package and why I think JMP users will benefit from it and I've talked to you about how I did it.   And now I want to actually show you how it works, which is hopefully the most fun part of this whole thing. So I moved this whole package, R package, over to JMP in a few months on my Fun Friday Free Time that I have, so that's just a few hours on Friday. And the JMP add-in   currently only works for linear and logistic regression models. I hope to be able to expand it to other models later, but right now, works for linear and logistic regression models. The other thing is that the add-in,   it doesn't have a lot of the fancy bells and whistles all the other built-in JMP modules have, you know. For instance, when you put a categorical variable into the response variable, the personality type doesn't automatically switch to logistic regression.   I haven't gotten around to that. There's a lot of other things I haven't gotten around to that I hope to improve with this package as people use it and either like it and give me feedback around it.   So it does lack some functionality. And the cool thing about the add-in manager is that you can just take your JSL code and there's an add-in...an add-in manager, JMP add in,   that allows you to create your own add-in from your JSL code, so it's really just a simple couple of button clicks, you can take your JSL code and turn it into an add-in that you can share with the whole JMP community.   And I've posted this on the community.jmp.com website for everybody to be able to go out there and access that add-in that I'm about ready to show you   and to access the data set that I'm going to use as well. But this will work with any of your data sets that you have, not just with my example.   So I'm going to give you a live tutorial really quick of this add-in, called...the add-in, I just called it exploring interactions, and   this is all going to be done though via the feasible solution algorithm that I talked about earlier. The example is a linear model   where I'm going to fix two variables, so that would be my base model. My base model will have a continuous response variable and two covariants that I want to be able to adjust for.   And then, what I would like to do is I'd like to consider second order interactions to add to that model between any of the 10 variables I have in my data set.   And I'd like to do five random starts, and I want to make sure my criterion, and this currently is the only criterion that's built into the   JMP add-in, is that we want to minimize the interaction's P value. So each of these interactions that we check   produce a P value and what I want to do is I want my solutions to be the ones that have   very small P values. So it's going to search the space, based on what is the interaction's P value and go to the ones that have the best one. So our results are going to have a lot of interactions that have small P values. And   usually what I'll recommend after you do this type of procedure, the feasible solution algorithm,   is you follow it up with plotting the data, looking at those interactions in your model, and thinking critically about them in contextual sense, because at the end of the day, we're exploring the data here.   This is not inferential in any way. We're just using the signal of the data to direct as to what interactions may exist in this data.   And again, the data and the add-in are posted on the Community website.   Alright, so I'm gonna stop, get out of this really quick, and I'm going to pull up the   JMP data set tha I have. Hopefully you guys can see this.   And as you can see, just a really quick overview, this is actually all a bunch of random data that I generated. There's no real structure to this at all. There's...this isn't real data just randomly generated data in JMP.   It has 20 observations here. I've got 10 continuous variables, explanatory variables.   I have a continuous response variable, and I have also categorized the response variable as either being greater than zero or less than zero.   So if you want to be able to look at...look at the logistic regression results you could do a logistic regression example with the data as well.   So, once you get the add-in, go to the website, you download the add-in, you just double click on it and it installs directly to JMP. It's really easy.   And then you can just go here, so you go to add-in, go...once you have your data set open, you go to add-in to explore interactions.   And this will pop up here. So we have a couple of things. So we'll see all of our variables over here on the left, just like you would with any other module in JMP.   And so I'm going to pick my response variable, which is going to be Y, and then I'm going to fix, so this is where I'm constructing my base model. So these are all the linear main effects that I'm adding to my model. So in this one, I've got two variables,   X1 and X2, that I'm going to add here. Now one of the things you need to do is specify the modeling type, so there's two types in my   add-in you can choose from. One is the standard least squares modeling type and one is the logistic regression one.   And this isn't going to automatically choose, like I said earlier, based on what you've put in here, so you can totally put in a categorical response variable and it's not going to switch. You have to switch it yourself. That's one of those fancy bells and whistles, hopefully, I'll get to later.   Get rid of that. Put this back in here.   Alright, so this is our setup, and so this will be fitting the model y equals beta zero plus beta one X one plus beta two X two. And now down here is where I'm going to put in the variables I'm willing to   ...or, I want to look for interactions between, so I'm going to select all X1 through X10. I'm going to add those over.   And then I tell it okay, how many times, I want to run the algorithm? So here I'm going to do five times, for the sake of time,   and the order the interactions to consider, so I'm only going to do two.   I am recommend staying three or below, it usually gets a lot harder to interpret interactions that are four-way interactions, five-way, but   the algorithm...or the way that I've set this up is that you could go that high if you wanted to. I haven't put a limit actually, but just know as the higher you go here, the more time it's going to take for you to run this.   So I'm going to hit Okay. This is going to take a second.   So you can see the code that I borrowed.   You can see it came up here and it tells me I'm 20% of the way done, 40%.   So these are solutions. So every time that check's going forward, a feasible solution is being found, and then what pops out at the end is a data table, so   you know, again, this isn't one of those features...I'd like to there to be a selection box, where you could select the interaction and then hit   create model, and it would do all that for you, just like JMP does already for a lot of different things, but I haven't gotten there yet. This just tells you what the solutions were.   So it gives you kind of a summary of what it did. So it did five random starts. It tells you what was the response variable, the fixed variables,   the interaction variables that it found, okay, and the R squared, the adjusted R squared for that model, the root mean squared error and then the interaction P value. Okay, so what we can see here is that   (if I click on this it'll show me) there's only two solutions, so there's...X2 X3 was one interaction and X10 X7 was another.   And so now, what I can do, I'm not going to do that now, is I can go and build this model and reproduce these results and look into the results to see, okay, is this a good statistical   algorithm...or good statistical interaction that I found here? You know, what do my leverage plots look like? What are all my different criterion   plots look like? And what are my diagnostic plots look like? Those types of things. And then I can begin asking questions, like does this interaction makes sense in terms of the parameter estimates and whatnot?   It's exploratory, so it's going to give you a potential solutions, but you have to follow that up with the due diligence and actually think about the results you've gotten   and ask questions to those who are context...contextual experts in this field. That's usually how we use this in the health sciences. We use this with health data, we produce interactions, we then to go to the physicians to say does this makes sense that this effect,   you know, would not be consistent across age or sex or whatever. And they'll usually talk to us about that, you know, think about it. The whole idea is that this will hopefully influence future studies, where we can power the studies to look for these interaction effects in a more   well-powered way.   Alright, so I'm gonna go back now to my PowerPoint. So that was this tutorial. You can see it's pretty easy to use. That was a really small example. You can ramp this up,   just know, you know, the bigger data sets you have, the longer that whole process that I just did is going to take. It's not going to be 10 seconds long; it's going to be,   you know, potentially, you know, 5-10 minutes long, depending on how big your data set was. I've done...used the algorithm, used the add in to do data sets with 100 variables,   explanitory variables, and it took about that amount of time, about about seven minutes, to run it to do about five or 10 random starts.   Things are faster in the R package than they are in the JMP add-in, but that's because, you know, again, this is my first stab at this and, hopefully, through iterations that will get even faster and better.   So let's talk about future improvement. So I want to allow users to be able to optimize or to   find feasible solutions based on any, you know, any criteria they want. So if they want to find feasible solutions based on optimizing R squared or optimizing AIC or optimizing the misclassification rate,   you know,in logistic regression models, I want to allow them to be able to do that. And it's really not that hard of a thing to do, I just need to go in and do it.   Automatically save or select modeling personalities based on variable selections. That's just simple. You put in a   binary logistic variable, you know, it's going to automatically select logistic regression for you. I don't think that's probably super hard, I could probably do that pretty quickly. I feel confident doing that having gone through this process of moving my R package over.   I want to improve the speed some. There's a lot of things I could do to improve the speed of this.   So, I'd like to be able to do that. A recall button would be nice. I don't have that now. That's one of my favorite features in JMP is being able to click that recall button.   And then I'd like to be able to streamline going from the results that you got from FSA to building a model. So   you can do that already, like when you do forward selection in JMP or Lasso or any of those things, you can just click make model and it brings all the variables over super easily.   I'd like to be able to do that with this feasible solution algorithm, just get those types of results and just click the make model and it automatically does everything for you   so that you can explore them, and, again, hopefully, you make good sense of those interactions that you found.   So some acknowledgments that I'd like to go through. So first is Mark Bailey. I couldn't have done it without you. Thank you so much for your support with the JSL code. Thank you to Ruth   for just getting me in contact with Mark, as well as just being a general, you know,   support person for this project and just providing good feedback. Anne and Larry, I just appreciate   your help with all things that had to do with JMP and with our past relationships and for encouraging me to do this Discovery Summit. It's been really great and I'm excited to meet more people in the JMP community   through it, so thank you for that. You can get in touch with me in a number of ways. You can send me an email at my university email, you can contact me on Twitter if that's the way you like to do things.   And I also am planning on posting this JMP add-in up on github, as well as the Community website, so please feel free to check either of those places for future updates of the add-in.   So yeah, thank you for having me and I'm excited to take your questions during our allotted time, thank you.
Thomas Walk, Large Plant Breeding Pipeline DB Manger, North Dakota State University Ana Heilman Morales, Large Plant Breeding Pipeline DB Manger, North Dakota State University Didier Murillo, Data Analyst, North Dakota State University Richard Horsley, Head of the Department of Plant Sciences, North Dakota State University   Crop breeders, often managing numerous experiments involving thousands of experimental breeding lines grown at multiple locations over many years, have developed valuable data management and analysis tools. Here, we report on more efficient crop evaluation with a suite of tools integrated into the JMP add-in dubbed AgQHub. This add-in provides an interface for users to first query MS SQL Server databases, and then calculate best linear unbiased predictors (BLUPs) of crop performance through the mixed model features of JMP. Then, to further assist in selection processes, users can sort and filter data within the add-in, with filtered data available for building reports in an interactive dashboard. Within the dashboard, users segregate selected crop genotypes into test and check categories. Separate variety release tables are automatically generated for each test line in head-to-head comparisons with selected check varieties. The dashboard also provides users the option to produce figures for quickly comparing results across tested lines and multiple traits. The tables and figures produced in the dashboard can be output to files that users can readily incorporate into variety release documentation. In short, AgQHub is a one-stop add-in that crop breeders can use to query databases, calculate BLUPs, and generate report tables and figures.     Auto-generated transcript...   Speaker Transcript Curt Hinrichs Alright, Tom Walk, with Anna and Didier with their poster on AG.Q.Hub. Tom, take it away. Tom Walk Thank you so much, Curt, and thank you to the JMP community for inviting us to this presentation. We're so glad to be here to show you our work. Today we're going to talk about a tool we're building at North Dakota State University in the Department of Plant Sciences called AG.Q.Hub. Its primarily the work of our team, the plant breeding database management team, of myself and Anna and Didier and Rich Horsley who's the department chair of plant sciences. And what we've done is we're trying to help the plant breeders who've had this long established cycle. You can imagine that if you want to improve crops, it's going to take a long process. If you want to do it right and consistently, you have to have set up a lot of experiments. You're not just going to get lucky very often and choose the best crop. So what you have to do is, you have to set up a lot of crosses and a lot of trials with thousands of lines initially, and from those, you have to go through this decade-long cycle or more, and choose...make choices every year about which were the best lines to advance. and this is...all would change with environmental conditions, so we have to get that right combination of genes with the environment. And you have to have the right analyses and experimental designs to do that, so this gets very complicated for a plant breeding team. And it's a long process to make any variety selections. And what we want to do is to make the selection processes easier every year. It'd be nice to shorten this whole process, but our more immediate goal is to to make the process more efficient, the selections more efficient at each stage. And what we've done for that, is we're developing this tool AG.Q.Hub, and we were using that with our breeding programs. We have 10 breeding programs within plant sciences at NDSU and over 60 users. We've incorporated also two research extension centers with more variety trials and field sites. And this is...this list is growing. We're trying to add more users and will probably add in more research extension centers. So what what's nice about AG.Q.Hub and the reasons why we have these users is because we have the functionality at AG.Q.Hub and it allows you to connect to the database directly and you can see data from decades worth of experiments. And once you have that you can do this analysis. You can look at experimental designs, you can view the histories of your varieties, you could look at distributions of data for individual experiments, you can calculate blups and make those predicted values. And once you have those predicted values, you can get those head-to-head comparisons, and if you have those compare...once you have those head-to-heads, then we can start building reports. And that's where we're going to get into is, we want to be able to make reports with subsetting data and build tables and the visualizations that make it a lot easier for our users. And all this is going to be done within seconds, with a few clicks in AG.Q.Hub and it's saving what's up to weeks of time in the past where users using spreadsheets and workbooks. So just to give you an idea of a workflow in AG.Q.Hub, here's one cycle of generating data for reports. And so what users will go into AG.Q.Hub, they'll select a database they want to use, and what type of analysis or query they want to have, and the output they want to have. And once they...then they'll click start and then after that, another window will pop open that will prompt them for the parameters for the queries, such as the experiments they want to query, the years, the traits or treatments that they want to look at. And after they select their parameters and click OK, then the data will pop up in these data tables and the data tables are in... within the AG.Q.Hub add-in, so all the data tables are compiled nicely within these tabs in here in AG.Q.Hub. And then here are some of the newest features we have is that users can fill...they can select the varieties they're interested in, and they can sort, and they could select, and then they could do some filtering, and then they can make these filtered variety tables. And with with these filtered variety tables, they can export those into their reports into Excel or other documents. And once you're done with one, you can start over and move on to another set of experiments. And click cancel after you do as many analyses as you wish to. We're still working on this, it's a work in progress and we get a lot of great ideas from our users. What we want to do... we're always...we're always expanding this to more users and research extension centers. That's been helpful for us to build this up. As we do that, we're looking to compile templates and release...of release tables used by the programs. With that we can build up some sort of output tables that make it easier for the users to produce head-to-head tables and variety release tables. And then we'd also like to make it easier by adding visualizations for making quicker variety comparisons. And Anna has some great ideas with that, with her experience as a plant breeder. Finally, we there's always...we're always looking to make the interface more dynamic with maybe changing options as users click things. And with that, that's my talk, but I would like to start this video, this short video to give you an idea...give you a better idea of what AG.Q.Hub does. And Curt, thank you for this opportunity again. With that, I do have one more thing I'm excited to show you and I'm going to request the share screen. I want to show our users one more thing, and this is the newest things with AG.Q.Hub. This is what we're excited...this is the direction we're going. What we have...what we have here is once the users make their selections and filtering and what they want to make tables.... the varieties they want to make tables with, we're making dashboards that open up and they can select among their filtered varieties, for which ones they want to be check lines. Like, for example, we want to be this historical varieties to be our check lines, and we want to compare our new test lines against those check lines. And we're going to select the different types of traits we want to look at, the traits we've seen in the field versus traits we measure in a laboratory or traits such as disease traits. And then we click make tables and it'll output these tables. I'm not going to do that right now to save you a little time. But what it's going to do is output tables for each of these traits, for each of these varieties, and you can see it's still needs some work. I need to put the names of the varieties in, but it's still...we're working on this, but I'm very excited and I wanted to show you this before I end. And that's the direction we're going and we're going to build on this dashboard to keep making these tables and make outputs for our users so they can put these the better formats for Excel. They can format on further outputs to Excel and Word or whatnot, and build visualizations where we can do head-to-head comparisons by comparing how these...this variety does against this variety. So we're building up on these dashboards. We are very excited about this and I'm so happy to show this and share this with the JMP community. And before I go, I want to thank all the North Dakota State University Anna, Didier, Rich and myself, and all our other users. Thank you so much.  
Danilo Toniato, Engineer, WEG Andre Caron Zanezi, Engineer, WEG   Faced with a vast and competitive field, appliance manufacturers are constantly seeking for quality improvement, aiming to meet the needs of an increasingly demanding market. A reflection of this requirement can be observed in the progressively restricted specifications to the sound power level (NWS) emitted by the product. To meet and exceed customers' expectations, this work addresses the use of design of experiments (DOE) to analyze multiple electromechanical characteristics and its respective relation to motor noise across the frequency spectrum, specifically in washing machines. Using Graph Builder and DOE in JMP to better fit models to perform these analyses, it was possible to define the most significant factors for reducing noise without impacting motor manufacture costs.     Auto-generated transcript...   Speaker Transcript Janice LeBeau, JMP Good morning.   Good morning, Nice to meet you. Janice LeBeau, JMP Nice to meet you too, I don't see your face yet didn't either.   Oh i'm so sorry. Janice LeBeau, JMP Are you.   i'm great, thank you for us to endure. Janice LeBeau, JMP i'm doing well, thank you.   See I hope this is okay with you i've started the recording and then they'll go in and Edit it and take out all of our pre conversation but there's just a few things that I want to go through with you, if that's okay.   Okay perfect. Janice LeBeau, JMP Okay, and.   let's see I introduced myself janice about i'm going to be doing a recording for you today, and I wanted to make sure that you are okay with us recording this, as we will be posting it in the descent discovery summit archives junk community.   See here okay just need to ask you if your cell phone and computer notifications are silent.   Yes, yeah. Janice LeBeau, JMP Okay, all your fans heaters, and anything else distracting you might want to turn off.   No. Janice LeBeau, JMP Okay let's see what else department okay.   Solid colors no logos no jewelry okay um did you want to keep on your name badge.   i'm fine I mean you look you look very nice this morning okay and i'm your presenters display set tonight to 20 by 10 at.   Did you auto hide the windows taskbar.   You want to hide anything that might pop up on your screen because you'll be sharing your screen for your presentation.   You might want to close outlook and Skype.   I see one second. Janice LeBeau, JMP Okay, great and so.   What we'll go ahead and do is um i'll do a countdown for you, and when I get to one i'll go ahead and mute myself and turn off my camera.   And then you can start your presentation by introducing yourself, you want to go ahead and share your screen and there's no you don't have any one else joining you today right.   No is only. Janice LeBeau, JMP Okay, great Where are you doing it, how do you save it into Nina.   Yes, exactly I am from Brazil, I mean, is it or.   Is it small district they're not not real DJ. Janice LeBeau, JMP So you're in Rio right now in that area.   I north. Janice LeBeau, JMP awesome that is awesome Oh, I always wanted to go there, that is so awesome.   To see some amazing. Janice LeBeau, JMP Oh, I can, and the people are beautiful the food is wonderful.   Well, I wish you all the best with your presentation, do you feel comfortable do you need anything from me.   No, no, every time Thank you so much for asking. Janice LeBeau, JMP Okay, and then, if you want to like when you are through you just want to say, well, this can come concludes my presentation i'll wait for a couple of seconds and then i'll i'll turn off the recording for you.   Okay okay well good.   yo share my screen one second. Janice LeBeau, JMP Okay, you might want to do that make sure it's all set.   Okay. Janice LeBeau, JMP Okay perfect well i'm going to go ahead, then and stop my video and then i'll are you ready for my 54321 countdown.   Okay, good all right, well, here we go 54321.   Hello everyone.   My name is Danilo Toniato. I'm a mechanical engineer and PhD student. And today, I will be presenting a work that was performed aiming improve the vibro-acoustic characteristics of [???] motors for the appliance line.   it's undeniable that the market and the consumers are increasingly demanding with the products, especially when it comes to noise in their home appliances segment.   After all, when we get home from an exhausting day at work, what do we usually want is to have a quiet peaceful night with our significant other.   So main to exceed our customers expectations we [???] started to mitigate the noise generated by our electrical motors. Of course, our second objective is no product cost is increased and all this through a DOE analysis to help us. So, shall we begin?   Originally we analyze the magnitude spectrograms of reference motors, and according to the user experiences during the operation, of these motors, in application. We listed the frequencies that are more critical, more uncomfortable for the users auditory sensation   Also we establish our working ranges, which in this case were mainly concentrated in the 480hz region. We identify through the research literature, what is the main root cause and what are the main factors that could minimize such noise.   In possession of this information a DOE, as you can see here,   was developed so that we can quantify the influence of these variables. I'm sorry.   And so that we can define and reduce levels at which our prospects can deliver a product which is good capability and gathering noise levels, without interfering with any other electrical mechanical characteristics.   And maybe without burdening our final consumer.   All together there were four continuous numerical factors in a block and [???] factors for our DOE.   The results were very positive. And, we are able to achieve the noise and vibration levels desired. Actually, the factors proven to be significant but only are the frequencies of the interest as we expected.   We achieve a reduction of approximately 60% in the vibration levels. As we can see here in the variability plot of the JMP, and here I have my.   left of the screen, we can see by the analysis about the variance of doing the Fit Model of the JMP model that the only significant the factors we can choose for our DOE in the frequencies of the interests like 480hz.   And is not relevant for any other purposes.   And the reduction in noise levels proven to be even more [???] as we can see here.   And even more significant when comparing we compare the surface plot here.   This first graph is our initial results before the DOE analysis. And the second plot is the best condition we obtain achieve in the DOE. And the colored frequency line is the best condition the frequencies that even are critical for our consumers experience.   In conclusion, we are able to understand which of these factors listed in the research literature were the most significant. Meaning, which combination of these factors and the respective levels   would the meet the market needs optimizing our process, optimizing our product and minimizing our production costs.   This automation that I was talking about was performed using the Profiler tool in the Fit Model model and subsequently was validated with the production of a larger scale batch.   Now, in, then I would like to thank you for watching and our often hard work team and the sponsor they were essentials, to the conclusion of the work, thank you. Janice LeBeau, JMP Okay, thank you so that concluded your presentation, then.   Yes.   Was. Janice LeBeau, JMP very nice um.   Were you just to do 10 minutes daniella.   i'm sorry. Janice LeBeau, JMP About 10 minutes right.   Yes, 10 minutes if you stay massively time that was planned.   I try to reach seven minutes.   Okay.
Isaac Himanga, Senior Analytics Engineer, ADM   Manufacturing processes are highly correlated systems that are difficult to optimize. This correlation provides an opportunity to identify outliers. Here, we demonstrate an entire workflow, which involves:   Obtaining data from the OSISoft historian. Modeling and optimizing with multivariate models that account for process constraints. Using the MDMCC platform. Scoring models online and writing abnormal conditions back to the historian. In practice, many processes are hard to optimize. Sometimes a system cannot be reduced to a set of independent variables that can be tested; in other cases, the process can become unstable under different conditions. To address these issues, we are using new features in JMP 16 to optimize for cost and quality in the latent variable space, while accounting for constraints on observed variables. In order to remain at those optimal conditions, we use the MDMCC platform to quickly identify deviations from normal.   Add-ins are used to minimize the time spent gathering data and scoring models: one to pull process data into JMP from the popular PI historian, one to quickly recreate calculated columns, and a third to score models online and send results to the historian, including contributions from the MDMCC.   Add-ins referenced are available on the JMP Community: Aveva/OSISoft PI Tools  Scripting Tools      Auto-generated transcript...   Speaker Transcript Isaac Himanga Hello good morning. Janice LeBeau, JMP hi Isaac how are you. Isaac Himanga doing well, how are you. Janice LeBeau, JMP Fine, thank you I'm Dan it's about I'm going to be producing your presentation today for you. Isaac Himanga Right fantastic. Janice LeBeau, JMP And I can't see you Isaac. Isaac Himanga yep I'm sorry about that I'm getting it set up here. Janice LeBeau, JMP Okay, and just to let you know we are recording right now so.   Absolutely cool I feel like we've met before. Isaac Himanga We might have either I'm sure, one of the summit's if it was anywhere. Janice LeBeau, JMP Okay, great well how you doing today. Isaac Himanga doing well, how are you. Janice LeBeau, JMP Good good I love your background that's a beautiful color you've gone on for sure. Isaac Himanga Thank you.   Good. Janice LeBeau, JMP I just am going to go over a few things and that I just need to reconfirm with you.   Your presenter name company and abstract title.   Okay, do you have any co presenters okay all right and you're okay about us recording this and posting it to the discovery discovery summit archives. Isaac Himanga Just absolutely and I guess, I was going to post play into record just my end on my site as well, just in case something happens so we'll have a copy of it and that's all right with you. Janice LeBeau, JMP um I don't know if that will affect the quality of anything or. Isaac Himanga shouldn't I do it pretty often. Janice LeBeau, JMP Okay I'll let you take the lead on that I'm your background looks good your microphone sounds good okay just wanted to reconfirm your cell phone and computer notifications are off. Isaac Himanga should be but I'm gonna check again just because when. Janice LeBeau, JMP I call your artwork to.   And then.   Energy of eight and 1920 by 10 day, do you have any pets in the background. Isaac Himanga So. Janice LeBeau, JMP It could make noise, like my cat probably will, so I referred him. Isaac Himanga Well, in theory, I'm the only one home today so. Janice LeBeau, JMP Did you close out look Skype or any other applications that make noises. Isaac Himanga Accessing my turn this one off right yes.   Okay. Isaac Himanga yeah, so I think the only thing that should be on as you and I'm actually going to turn your volume way down when we get started here, just in case something happens I'll be able to see you but it'll just be quiet that way if anything else does make noise oh. Janice LeBeau, JMP Well, what I'm going to do is.   I'm just going to give you a couple more pointers here I'm going to mute myself and hide my video you can share your screen for your presentation.   And, before I mute myself I'll do my 54321 and then give yourself a couple of seconds, and then you can go ahead and introduce yourself.   and start recording you can start your presentation and then, when you just tell me well this concludes today's presentation, I hope you enjoyed it whatever you want to say and then I'll stop recording for you okay.   Good and if we're into it, at the very beginning and you're not happy, you know we can stop it and start over again so I'm totally cool with all of that. Isaac Himanga Okay now hopefully it won't be an issue how close to 35 Minutes do you want it, I when I practiced mean it was torn between like 34 and 36 is that do you want to read it 35 or. Janice LeBeau, JMP You know it's fine I mean just do whatever you have to do I know one person we were at 20 minutes one if they told you what 30 to 40 minutes long. Isaac Himanga yeah somewhere, I remember 35 they might have given a range there. Janice LeBeau, JMP that's cool don't worry about it. Isaac Himanga Okay, good deal perfect. Janice LeBeau, JMP Okay, so we are recording in the cloud.   We can see that.   Oh, that looks very good that looks yummy. Isaac Himanga perfect. Janice LeBeau, JMP So yeah.   So, are you ready my video. Isaac Himanga Of what about let's see so you don't have a timer cameras going to set a timer on my up there, so I could do 30. Janice LeBeau, JMP I mean do whatever you want, because this will start.   I mean, honestly, if you think it's 35 and you go to 32 that's fine if you go a little bit over that's fine.   Okay. Isaac Himanga So I got everything here.   All right. Janice LeBeau, JMP Okay, good so you feel good because I'm going to do the.   I'll do the 54321 and mute myself.   And let's see you froze there you go Okay, so you ready. Isaac Himanga All right. Janice LeBeau, JMP I think, by 4321. Isaac Himanga My name is Isaac Himanga and I'm going to demonstrate a workflow I use at ADM to optimize manufacturing processes using multivariate models.   I'll start with pulling data and building a model, finding realistic optimal conditions, identify abnormal conditions, and finally score the model using current data.   There's a lot of other information on the individual platforms here, so instead of discussing the details of each step, I'll only highlight a few commonly used features and instead try to show the whole workflow.   I will say most analyses with the amount of detail this one has take a little longer than 45 minutes.   So head over to the article for this talk in the JMP Community for more detail and information, including a journal with screenshots of steps I'll move through pretty quickly here.   I'll start with a brief overview of ADM and the general workflow.   Then I'll put this presentation aside to show you the process in JMP. You'll see what it looks like to get and clean data, use the profiler to find out multiple conditions,   use the model driven multivariate control chart platform, write a script to score new data against that model, and finally, I'll return to this presentation to briefly give one method to continuously score that model using JMP.   First, a little about ADM.   ADM's purpose is to unlock the power of nature to enrich the quality of life.   We transform natural products into a complete portfolio of ingredients and flavors for foods and beverages, supplements, nutrition for pets and livestock,   and more. And with an array of unparalleled capabilities across every part of the global food chain, we give our customers and edge in solving the global challenges of today and tomorrow.   One of those capabilities is using data analytics to improve our manufacturing processes, including the method I'm about to talk about.   I am part of the relatively new focused improvement and analytics center of excellence.   And our growing team is invested in techniques, like this one, to help our 800 facilities, 300 food and feed processing locations, and the rest of our company around the world make better decisions using our data.   Now, an overview of the workflow.   The four steps I'll review today only represent part of the complete analysis. In the interest of time, I'm going to omit some things which I consider critical for every data set,   like visualizing data, using validation, variable reduction, corroborating findings with other models, and aligning lab and process data.   getting data, building a model, scoring that model on demand in JMP, and then scoring the model continuously.   JMP has tools to support each step, including an array of database connections, multivariate modeling modeling tools, like partial least squares, principal components, and the model driven multivariate control chart platform, and of course, the JMP scripting language or JSL.   Let's start with getting data. Despite the many database connections and scripting options available, we needed a quick way to pull data from our process historian,   a popular product called PI, without writing queries or navigating table structures. A custom add-in was the answer for most data sets. This add-in was relatively...was recently posted to the JMP Community.   Two more add-ins assist in this process. One, generically called scripting tools, includes an option to quickly get the script to recreate calculated columns   when combined with the save script to functionality built into most JMP platforms. Analyses can be recreated quickly and scored on demand by a JMP user.   The last add-in, called the JMP model engine, is also the newest. It uses a configuration file and information saved to the data table from the PItool's add-in to get data.   That makes calculations using column formulas or any other JSL script and then writes results back to the historian.   And the interest of time, I'm going to move very quickly through this last step, but again, I encourage you to look for more details on the JMP Community using the link on the   agenda slide of this presentation.   Each of these add-ins were overhauled to remove sensitive information but we're shifting users to the same code that's posted on the Community. So if you have ideas and how to improve them, feel free to collaborate with us on the JMP Community over...or over on GitHub.   With that, let's open JMP.   Behind the scenes, the ADM...or the JMP PItool's add-in uses a couple different PIproducts to pull data, including SQL DAS, OLEDB and ODBC. The instructions to set this up can be found in the help menu for this platform.   Today we're going to pull data for a set of PItags from May 1, starting at noon, and then we're going to pull another value every day at noon until today.   We're going to do this for a certain set of tags, those are listed here in this...   in this box. Notice we've actually included what's called a friendly tag or a name that's more descriptive than just the name that's in the PIhistorian that will help us identify those columns later in the data table. There's all these little question marks around the   ...around the add-in giving more information, including a description for this friendly tag format that we're going to use today.   When I hit run query, it's going to open a data table that looks like this one. It's got a set of columns for all of the numerical values in that data table.   It's got a set of columns for any string value, so if you had characters stored in the PItag, it will pull those. And we can also see the status for that PItag   for each row.   Also in the column properties for that data table, if we open the column info, we're going to see information that was used to pull that data point, including the PItag,   the call type and the interval. For more complex queries, like an average or a maximum, we're going to see more information here.   I will note that this is real data that's been rescaled and the column names have been changed, but the analysis that we're going to see today should look and feel very similar to a real analysis using actual data from one of our facilities.   Finally, I'll point out that there's a script here saved to the data table called PI data source, which is the same script that's actually shown in the add-in   and it contains the SQL that's also available here. Again behind the scenes, this uses   SQL DAS or PI DAS in order to get that information from PI, and this is all the...all the scripts that it's going to run to get that data. We're going to come back and use this again near the end of the talk today.   Okay, now that we've got data, we need to clean that data up. We're going to use multivariate tools today to do that, specifically the principal component analysis. I'll pull all of the numerical values from that data table and put it into the y columns list and then   right away, we can see some values that have...quite a few values that have particularly low scores for component one. If you look at the loadings for those   different factors, we can see the component one includes high loadings for all of the different flows in this data set.   So that tells me that all of these values over here on the left have high flows across the board for the whole system. Using some engineering knowledge, I can say that this represents downtime for that process, so I'm going to go ahead and hide and exclude all of these values.   Now that we've done that, we'll recalculate the principal component analysis, so we'll use redo and redo the analysis and then close the old one.   And now we can see the loadings are perhaps a little bit more meaningful. Principal component three, for example, explains most of the variation in flow 2 and there's a little bit of separation here.   The first three components explain the majority of the variation, so I'm going to use those three components when looking for other outliers in this data set.   To do that, I'll open the outlier analysis and I'll change the number of components to be three.   And then we can see both the T squared plot, and I can also open the normalized DModX plot   to see points that either have extreme values for one or more of the scores or points that have a single   column or a single cell that has a value that's unexpected or outside of the expected range, based on all the other columns in that data set for that particular role.   For now, we're just going to quickly clean this data set by excluding all of the T squared and DModX values that are above the upper control limit.   One more thing that's commonly done when cleaning a data set is transforming columns, and I want to show a feature of the scripting tool add-in that makes   it a little bit even trying to apply and   transfer to the new formula column menu. If I select three columns or any number of columns and go to the custom transformation option, which is again loaded as part of that scripting tools add-in,   I can select a Savitzkey-Golay transformation and hit OK, and it will add three columns to the end of the data table with a formula containing that transformation.   I will note that the cleaning we did could have been done directly and in PLS. I often use a PCA first, though.   Okay now we've cleaned our data set, we need to actually build a model   to try and predict that...our process conditions. Maybe another quick note about this data set, we have a flow target up here.   Today, our goal is going to be to create a certain amount of this target flow using as little of flow one as possible and also taking into some...   into account some constraints on these quality parameters. So because flow one is what we're primarily interested in, I'm going to switch over to a partial least squares model and use that flow...target flow in the y and all the other variables as X factors.   I'll just accept all the defaults for now and I'm going to jump ahead a little bit and right away use four factors.   When I open those four factors,   we'll see that the first two   represent the   variables that the plant normally changes in orders...in order to adjust how fast they run the process. So if they need to make more or less of this target flow, they often change factor one in order to achieve that that target rate.   Factor 2, on the other hand, relates primarily to these quality parameters, which are actually input quality parameters that we don't have control over.   So it's not something that we can change. So even though factors three and four have relatively...explain relatively small amounts of the variation of   our target flow and they explain relatively small amounts of the variation of our factor one, those are the ones that we actually have control over and so those are the ones that we're going to be able to use in order to optimize our process.   In order to use this...so we've we've built a model that explains the variation in our data.   In order to use that information, we need to save some of those columns or save the predictions from this to new columns in our data set   that we're going to use in the prediction profiler in just a few minutes.   We'll make use of a few new features that were added in JMP 16, allowing us to save predictions, both for the Y and the X variables, as X score formulas. And when we open the profiler, I think it'll help to illustrate why that becomes important.   So we've saved all three of the predictions, the X predictions and the T squares, back to our data table. Those should be new columns at the bottom as functions of these X scores.   We can also take a quick look at the distance and T squared plots within the PLS platform, and we see that while there's a few variables that have pretty high DModXs or T squareds, there's nothing too extreme.   These scores are often saved or are always saved with variable names that can become confusing as you save more and more or go through multiple iterations of your model. So the scripting tools contains another function, called the rename columns,   which will allow you to select an example query for PLS. It has a pretty complicated regular expression pattern here,   but notice it outputs a standard set of names that are going to look the same for PLS, PCA and other platforms within the data table.   So in this case I'm actually going to copy a specific prefix, we'll put before all of those columns indicating that this is the model we're going to put online for our quality control and it's revision one of that model.   When I change names, we can see it's it's automatically changed the name for all of these columns in the data set.   So we've built a model explaining the variation in   these columns, but what we haven't done is our original goal of figuring out how to produce a certain amount of flow...of our target flow using as little of flow one as possible. To do that we're going to use the profiler. Notice when we open the profiler,   we can add all of these predicted values, so not the X scores themselves, but the predicted values and the T squared, to this prediction formula section.   And then, when it opens up, we'll see across the X axis, all of our scores in the model or our latent variables.   And we can see when we move one of those scores, it's going to automatically move all of these observed variables together   at the ratio of the loadings in each one of those components. So importantly, take a look at these flows three and four, they always move together. No matter which score we move,   this model understands that these two scores are related. Perhaps one is a flow meter that they have control over, and perhaps a second one   is a second flow meter on that same line, but regardless it doesn't look like the plant is able to move one without moving the other one in the same direction.   So the goal is to find values for each one of these scores that are going to optimize our process.   Before we can do that, we need to tell JMP what it is we're trying to optimize. We need to say that we have a certain flow rate we're trying to hit and certain quality parameters that we're trying to hit.   So we're going to start by telling it that we don't care about the values for most of these columns. So we'll select all of the columns that we saved,   we'll go to standardize attributes, and we're going to add a response limit of none for all of these different columns.   Then we'll go back and we'll adjust that response limit. It can be more descriptive for the columns that we do care about. For example, the flow target will go back to response limit and we'll indicate that we want to match a value of 140 for that column.   Similarly for quality one, we want to hit the value of   20.15.   For the flow one, we want to minimize that value.   And finally, we need to make sure that the solution that we come to is going to be close to the normal operating range of the system, so we don't want to extrapolate outside the area where we built that PLS.   To do that, we'll use this T squared column, and we'll indicate that we want to minimize the T squared   such that it's below the upper control limit that was shown in that PLS platform. Here we can see the upper control limit was 9.32, so we'll use that as the minimum value here.   What we should see in the profiler now is every value below 9.32 is equally acceptable and, as you go above 9.32, it's going to be less and less desirable.   Under the graph menu I'll open the profiler once again, and once again take all of those predictions and T squared formulas and put them into the Y prediction formula.   And we still see the X scores across the bottom, but now we also see these desirability functions at the right.   And once again the desirability function for T squared starts to drop off above 9.32. The desirability is highest at low values of flow one and we've got targets for both the the target flow and the quality parameter.   Because we've defined all of these, we can now go to maximize desirability.   And that's going to try and find values for each one of those scores, and thus, values for all of the observed variables in our data set that are going to   achieve this target flow and achieve the targets that are in these that we defined earlier.   Notice it came close to hitting the full target, but it looks like it's a little bit low. It did achieve our 20.15   and it was within the T squared bound that we gave it. Most likely JMP thought that this tiny little decrease in desirability was less important than reducing flow, so we can fix that by control clicking in this desirability function and just changing the importance to 10.   Now, if we optimize this again, it should get a little bit closer to the flow target. Before we do that, though, I'm going to save these factor settings to this table at the bottom, so we can compare before and after. And then we'll go ahead and maximize desirability.   Looks like it's finished and we're still within bounds on our T squared.   It has achieved the 20.15 that we asked and it's certainly much closer to 140. So now we could once again save these factor settings to the data table.   And we should now have factors that we can give back to the manufacturing facility and say, hey, here are the parameters that we recommend.   The benefit of using a multivariate analysis like this, we talked about those flows three and four being related earlier,   using this method, we should be able to give the plant reasonable values that they're actually able to run at.   If you tell them to run a high value for three and a low value for flow four, they might look at you and say well that's just not possible.   These should be much more reasonable. Note that not all of these variables are necessarily independent variables that the plant can control. Some of those might be   outcomes, or they might be just related to other variables. In theory, if the plant changes all of the things that they do have control over, the other things should just fall into place.   So now we've optimized this model, the next step is often to make sure, or to verify, are we running normally. So once the optimal conditions are are put in there, it's good to use the same model to score new rows or new data and understand, is the process acting as we expect it to?   To do that, we'll use these same prediction scores that we had in...saved from the PLS platform, but this time we're going to use those in the model driven multivariate control chart   under quality and process.   I'm going to use the time and put that in time ID, so instead of seeing row numbers, we're going to see an actual date on all of our charts and we'll put the X scores into the process.   Unfortunately, the range doesn't always work correctly on these charts, so if I just zoom in a little bit, we'll see that here are periods where we had a high T squared and that high T squared was mostly the result of flows two, three, four and one, so all flows in that...   are the high contributors to this T squared. If we click on that and hover the mouse over one of those bars, if I click on that point again, then we'll see a control chart for that individual value...individual   variable with the point that we selected highlighted on that chart.   And I'm going to hide this one, though, and not only look at T squared, we also can do the same thing for DModX or SPE, if you're still looking at SPE. Once again,   this doesn't always work out correctly.   So we'll zoom in on the   DModX values again.   So DModX is going to indicate points that have a high or low value for an individual column, compared to what was expected based on the other rows, the other data in that row.   Here we can see that this point is primarily an outlier due to flow five.   I do find the contribution proportion heat maps to be pretty useful to look and see patterns in old data when, for example, one variable might have been a high contributing factor or acting abnormal for a long period of time or for some some section of time.   So this is a chart that might...that we might want to look at every morning, for example, or periodically to see, is the process acting normally?   You come in, you want to open this up and see, is there anything that I should adjust right now in order to bring our process back under control?   So to do that, we want to recreate this whole analysis from pulling PI data to opening the MDMCC platform and have it all be available at a click of a button.   To do that, we're going to write a JSL script that has three steps. It's going to get the data from PI, we're going to add the calculated columns, and then open the model we have in the multivariate control chart platform.   getting data from PI. If we go back to the original data table that we...   that was opened after the PI tools add-in was run, we can see this PI data source script saved to that table. If we edit that script and just copy it, we can paste that into the new script window.   I'm just going to make one change. Instead of pulling data from May, we'll start pulling data from August 1 instead.   Now we need to add those calculated columns. So remember we...in the PLS platform we use the saved score...save as X score formulas option.   In order to recreate those, we can just select all of the columns in the data table and use the copy column script function that was added again in that scripting tools add-in.   Once we copy the column script, we go back into this new script that we created and we'll paste what are a bunch of new column formulas to again recreate all of those columns.   Finally, model driven multivariate control   chart has an ??? most other platforms, where you can save the script to the clipboard and you can paste that into the same script window.   Now, if we imagine it's a new day and we have everything closed, and I want to look at how our process is doing, I would just run this same script. Note that I could start it with   a specific slash slash and then an exclamation point in order to run automatically when the script is opened.   When I hit run, it's going to open a data table that looks just the same as our original table. It's got all of the same columns.   It's added those calculated columns, so let's put these scores in and all the predictions, and it also opened the model driven multivariate control chart platform   where we can see that this recent value for DModX is actually pretty high, so the most recent data has some abnormal values, in this case, for quality three.   So again, quality three looks like it's not at the expected value, based on the other parameters. In this particular case, that might mean,   since quality three is an input parameter and quality one, two and three are often related, that might mean that quality three is a bad lab sample or it could mean that this is a new material that we haven't worked with before.   Okay, finally, let's talk about one method to run this model continuously. So this was recreated on demand,   where we wrote a JSL script to run this, but sometimes it's beneficial   to...or we found it beneficial to write these results back to our PI historian so that they can be used by   the operators at our facilities. So in the last couple of minutes, I want to introduce that add-in, it's called the model engine add-in,   which will quickly score models online. I should note that this should be used for advice only. It's not an acceptable method to provide critical information or for closed loop control.   For that you might consider exporting the formulas from the formula depot and using some other software package to score them.   As mentioned earlier, some of the predictions and information available in this model   has most has the most value at the moment it's happening. So knowing what caused yesterday's problem is great, but knowing what's happening right now   means making decisions with current model results and it allows some problems to be fixed before they become a big deal.   Of course there's many ways to score models, but the power of JMP scripting language, JSL,   provides a way to get predictions and anomaly information in front of operators at our manufacturing facilities using their existing suite of visualization and trending tools that they're already used to.   A pair of model engines, or computers running JMP with a specific add-in that started the task scheduler, are set up to periodically run all the models stored in a specific directory.   All the configuration is done via two types of configuration files, a single engine configuration file and one model config file for each model that's going to be scored.   Let's start with that model config file. Remember how the PI tools add-in saves source information to the data table?   Now that same information can be used to populate the model config file, which tells the model engine how to download a table with a single row containing all of the model inputs that it needs to calculate values from.   Later, the scripting tools add-in quickly save the scripts to recreate columns saved from the PLS and any other platform,   potentially including T squared and DModX contributions that can be saved from the model version control chart platform. These new column scripts are also saved in the model config file or in a separate file in that same directory.   Finally, the engine config file defines how the engine communicates with the data source where PI tools   add-in...where the PI tools add-in uses the OLDB   and SQL queries to get data, the model engine uses the PI web API to read and write data directly to PI.   By defining a set of functions in the engine config file, this engine can communicate with many other data sources as well.   Notice a set of heartbeat tags are defined, which allows the data source and other model engines to know the status of this engine.   Each model also has its own set of heartbeat tags, so if one machine stops scoring a particular model, the other engine will automatically take over.   Again this model engine idea is not intended to be used for critical applications, but I found that it allows us to move very quickly from deployment and exploratory analysis to an online prediction or a quality control solution.   With that, thank you all for attending. Remember that more information on each add-in and the journal I use today are available in the JMP Community. Janice LeBeau, JMP awesome job awesome.
Aishwarya Krishna Prasad, Student, Singapore Management University Ruiyun Yan, Student, Singapore Management University Linli Zhong, Student, Singapore Management University Prof Kam Tin Seong, Singapore Management University   There are several reasons for a flight to be delayed, such as air system issues, weather, airline delays, security issues, and so on. But interestingly, the most frequent reason for a flight delay is not about weather but about air system issues. The Federal Aviation Administration (FAA) usually considers a flight to be delayed when it is 15 minutes or more late in arriving or departing than its scheduled time. Flight delays are inconvenient for both airlines and customers. This paper employs dynamic time warping (DTW) techniques for 54 airports in the US. The study aims to cluster airports with similar delay patterns over time. In addition, the paper builds some explanatory models to explain the similarity between different airports or distances. In this analysis, we aim to use the time-series techniques to discover the similarity in the top 15% busiest American airports. This paper first filters the top 15% busiest American airports and calculates the departure delay rate for each airport and then uses DTW to cluster these airports based on departure similarities. Next, the similarities and differences between clusters are identified. This analysis will help inform passengers and airport officials about departure delays at 54 American airports from January to June 2020.      Auto-generated transcript...   Speaker Transcript ZHONG, LINLI _ Okay let's get started. Hi, everyone. This is the poster of time series data analysis of flight delay in the US airports from January 2020 to June 2020. We are students of Singapore Management University. I'm Linli. YAN Ruiyun I'm Ruiyun. Aishwarya KRISHNA PRASAD And I am Aishwarya Krishna Prasad. Now let's quickly dive in to the introduction of our project. Over to you Linli. ZHONG, LINLI _ Thank you, Ash.   In the left hand side, we can see that there is a line chart. This shows the annual passenger traffic at top 10 busiest US airport and...   in in the...from the graph, we can see that the number of the passengers in each airport experienced a sharp drop. This is because the passengers in airports showed the response to the spread of the COVID-19 in 2020.   And for our analysis, we would like to discover the delay similarity of top 15% of airports in America from features of the delay and geographic location.   time series, dynamic time wrapping, exploratory data analysis.   The time series and DTW are employed to find out the similarities between the clusters, based on the departure delays. EDA is used to draw the geographic map. Okay, let's go back to the data set.   Thank you, this is the data set.   Actually, our data set comes from the United States of Department of Transportation and from our   data preparation in the left hand side, this is the process of our data preparation. We firstly imported the csv file into JMP Pro 16.0.   And then we remove the columns and values which are not really useful for our analysis.   And after that, for the data transformation, we summarize the data for airports from different cities, and then we filter out the   top 54 airports, which is based on the total number of the fights in each airports and calculate the rate of the delay.   And after the data preparation, we save this file as SAS format and we import the SAS format into the SAS Enterprise Miner 40.1 for our further analysis,   namely the DTW analysis and time series analysis. After DTW process JMP Pro 16.0 was used again by finding out the singularity of different clusters and draw geographic maps.   And this is the introduction for data set. Let's welcome my partner to introduce more about our analysis. Aishwarya KRISHNA PRASAD Thank you, Linli. Now let's dive into the time series and cluster analysis. So we did the time series and cluster analysis using the SAS Enterprise Miner.   So this graph is one of the outputs that we obtained using the DTW nodes in SAS Enterprise Miner. So in the X axis, you can see that, you know, it contains the months from January 2020 to June...to July 2020.   And in the y axis, we can observe that there's a percentage of delays in the flights that we have included in our data set.   Now we can see that there is a sharp spike in the early February and in the late June, which seems to be strongly correlated with around the holiday periods of USA.   But, in general, other than these two spikes...major spikes, we can also see a steady decrease in the number of flight delays in general.   We then performed a time series clustering based on hierarchical clustering and the constellation plot of the same can be observed over here, using SAS. And we chose that...we felt that the number of clusters (7) is the most optimal number of clusters for our analysis.   Now, these are the clusters that are formed by using the TS similarity node of the SAS Enterprise Miner, so let's just take a...quickly take   the instance of Cluster1. So in this Cluster1, it contains mostly the international airports in the US.   So some of these airports are the Denver international airport, the Kansas City international airport, the Washington international airport,   just to name a few. So the delay in these airports are pretty large, as you can see, and this can be attributed to, you know, because this is located in the city that is frequented by tourists.   So similarly, the remaining clusters are formed by this similar behavior of the delays that are experienced in the flights.   Now the clusters that were generated in the previous step was then fed into the JMP Pro, and using the Graph Builder functionality,   we were able to build these graphs. So this graph contains the causes of delays in each of the clusters. So in over here, we can clearly see   which causes of delay is more prominent in each cluster. So for example, as you can see for Cluster1,   the late aircraft delay, that is, the delay caused by the previous flight to the current flight is more prominent compared to the rest. And the same queue follows for the rest of the clusters.   But if you see this cluster, right, so although this visualization in SAS is pretty intuitive,   we felt like for a...for a data set with large number of points, or more number of airports in our case, it would be quite difficult to analyze. So I'm just calling upon my peer Ruiyun to present another approach to analyze the clusters. Over to you, Ruiyun. YAN Ruiyun Okay, geographic location is another part that we focused on. The clusters were formed in SAS then we used Graph Builder feature in JMP Pro 16.0 to generate this map to   show where the different airports are located by cluster. Obviously airports from western and middle US are only included in Cluster 1 and Cluster 3.   And these two clusters show that cluster is not distributed in a specific region.   Cluster 2, Cluster 4, Cluster 5 and Cluster 6 demonstrate an aggregation of airports with specific region.   Airports from Cluster 2 are mainly concentrated in eastern United States, while the Cluster 5 and 6 are more likely contain the airports of some   tourist attractions, such as Houston, Phoenix, Baltimore, and Honolulu, which are the largest cities of Texas, Arizona, Maryland and Hawaii.   Even more to the point, Phoenix Sky Harbor International Airport is the backbone of national airlines and southwest airlines. That's   one of the key transportation hubs in the southwest America. In addition Cluster 7 is a particular case, as it just has one airport, San Juan airport from Puerto Rico.   We surmised that because of the special geographical location, any flight departing from San Juan airport has a long distance to travel.   And that's all about the geographical analysis and now my partner Aishwarya will give us a conclusion. Aishwarya KRISHNA PRASAD So in conclusion here, we tried exploiting the ease of usage of the DTW nodes in the SAS Enterprise Miner and also the sophisticated visualization and pre processing techniques in JMP Pro 16.0 to perform our time series analysis for our flight data.   So we performed the dynamic time clustering for 54 airports. And these airports were formed into seven clusters, based on the delay patterns during January and June 2020.   We observed that the carrier delay is mostly the main reason for delay in each cluster, while the late aircraft delay is not very far behind on being a major cause of delay in most of the clusters.   As part of the future work, one can include the COVID data points to improve this analysis further and also discuss the correlation between the delay and the cancellation rate of flights.   Thank you so much for listening to us. I hope you liked our presentation.
Zoe Toigo, Signal and Power Integrity Engineer, Microsoft Priya Pathmanathan, Senior Signal and Power Integrity Engineer, Microsoft Martin Rodriguez, Power Integrity Engineer, Microsoft Doug White, Principal Signal and Power Integrity Engineering Manager, Microsoft   Ever-increasing complexity of computer systems demands electrical power delivered efficiently to the chip. The design challenge of a power delivery network (PDN) is to provide stable, low-noise voltage through low-impedance paths, which influence overall system performance. Accurate models of a proposed PDN are necessary for initial system architecture decisions and continue to drive layout requirements as the physical design matures. One portion of the PDN design process involves creating a model of the chip’s package in a 3D electromagnetic field-solver tool (HFSS). Complex S-parameter models from FEM (Finite Element Method) field solvers are often simplified to circuit element approximations. Previously, input parameters to a two-dimensional circuit approximation of the package were manually fitted until the circuit matched the 3D model. However, custom DOE and response surface fitting in JMP reduced the number of experimental simulations and development time for model creation and correlation. The prediction profile revealed the polynomial relationship between 12 factors and six responses. Desirability functions were utilized to determine the values of the factors required to obtain the desired responses. Using this data, predicted responses were correlated to circuit simulations.     Auto-generated transcript...   Speaker Transcript Zoe Toigo Hello, my name is Zoe Toigo. I'm a signal and power integrity engineer at Microsoft, and my project is titled Power Delivery Network Model Prediction and Correlation. The power delivery network for the computer chip consists of all the interconnects, from the voltage regulator module to the pads on the chip and the metalization on the die that locally distributes power and return current. Because it interacts with the whole system, its quality is vital to overall system performance. Design of the PDN is ??? throughout the entire product design cycle. Early on, we can create models of proposed architectures and give feedback on how this would impact the system. And then, once the design is further refined, we begin an iterative cycle of working with hardware development to refine our MOD tools to match their performance and to also provide requirements for next revisions of the physical design. Earlier this year I was working on modeling a portion of the power delivery network, the chip package. Because this was done in a finite ??? for HFSS, small changes to the model take a long time to simulate, and so we created a 2D circuit approximation of this model. Because we weren't seeing good fit between the two different models, we turn to JMP to improve this process. We started by creating a 12 factor custom design of experiments platform, where our 12 factors were values of lumped elements in the circuit, such as resistors and capacitors. The out...the table generated by the DOE was used to run batch simulations of the circuit, and then from each of those simulations, we extracted values at port...six ports on the network, which became our six responses of the DOE. After all of that was finished, we did a fit of the model using least squares, and for each of the factors, we saw between a 97 and 99 R square fit. So we are confident using this model going forward to correlate this 3D and the 2D packages. With the...with the prediction profiler, we also applied desirability functions so that we could quickly get to the values of the circuit that would match our our 3D HFSS model. But to use this in the future, the production profiler has the added benefit of being able to tweak to show dependencies between the factors and the responses for small changes. This work would not have been possible without the help of my team members and also some theoretical concepts are leveraged from Eric Bogatin's book regarding power delivery networks. Thanks so much.  
Suling Lee, SMU, Singapore Management University   COVID-19 vaccines play a critical role in the attempt to assuage the global pandemic that is causing surges of infections and deaths globally. However, the unprecedent rate at which it was developed and administered raised doubts about its safety in the community. Data from the United States Vaccine Adverse Event Reporting System, VAERS, has the potential to help determine if the safety concerns of the vaccines are founded. As such, this paper uses the combination of both structured and unstructured variables from VAERS to model the adverse reactions to COVID-19 vaccines. The severity of the adverse reaction is first derived from the variables describing the vaccine recipient outcome following a reaction from the VAERS data sets. Next, unstructured data in the from of text describing symptoms, medical history, medication, and allergies are converted into a Document Term Matrix and these combined with the structured variables helps to build a model that predicts the severity of the adverse reaction. The explanatory model is built using JMP Pro 16 using Generalized Regression Models and Binary Document Term Matrix (DTM), with the model evaluation based on RSquare value of the validation set. The optimal model is a Generalised Regression model using the Lasso estimation method for Binary DTM. The key determinants contributing to the adverse reaction from the optimal model are number of symptoms, period between vaccination onset, how the vaccine are administered, age of patient, and symptoms related to cardiopulmonary illness.       Auto-generated transcript...   Speaker Transcript Peter Polito How are you doing today. Can you hear me. If you are speaking, I am unable to hear you. Hello. test test. Oh Hello Leo can you hear me. yeah sorry about that I think of the technical difficulties yeah. Peter Polito Oh no problem at all. Oh okay it's a nice Jimmy. Peter Polito Oh, where. Are you calling in from. Singapore. Peter Polito Singapore all right, how how late, is it there. About 909. Peter Polito yeah well. yeah kudos to you for. hanging on. so late thanks for making it. Possible no it's all right um yeah I hope I do a good one. Oh. No i've been like high stress about this Oh well, yeah i'm Okay, let me just put on a virtual background. Okay. Peter Polito And I gotta go through just a couple things on my end before we officially start. Okay. Peter Polito just give me a. minute here. yeah. Peter Polito I only bring in my checklist make sure I do everything correctly here. alright. So just to confirm. You are soothingly. yeah and your talk is titled a model for coven vaccine adverse reaction. Yes, is that correct yeah that's right. Peter Polito All right, and then just to make sure you understand this is being recorded for the use and jump discovery summit conference will be available publicly in the jump user community do you give permission for this recording and use. Yes, okay great. your microphone sounds good, I don't hear any background noises is your cell phone off and all that kind of thing anything that might make some random noises. hang on. i'll send it will find them yeah. Peter Polito Okay, and then. We need to check the can you go and share your screen and we'll go through and check the resolution and a few other things. Okay sure. Peter Polito Thank you. i'm sorry, is it Lisa ling or soothingly. My first name issuing nicely yeah. Peter Polito got it okay. um is it okay. Peter Polito I don't see it yet oh um you know pie covering it just a moment. That looks good. And if you go to the bottom of your screen does your taskbar actually I don't see your taskbar so we're good. Okay. Peter Polito And let me make sure. It are any programs that might create a pop up like outlook or Skype or any of those are those all closed down and quit. um yeah, I think, so I bet the checking on them. close my kitchen. Okay. yeah good. Peter Polito Okay, and then, are you going to be working just from a PowerPoint or you be showing jump as well. I was worried of the transiting, so I will be destiny it from PowerPoint. Peter Polito Okay, great, then I am going to mute and turn off my camera we are already recording so as soon as we. As soon as you see my picture go away go ahead and start and i'm not going to interrupt for any reason and we'll try and go through it's a 30 minute presentation so let's go through, and I won't even be here it'll be like you're talking to yourself. Okay, so that the Minutes right okay. Peter Polito Are you ready to go. yeah. Peter Polito Okay. All right, and it's just so you know when we actually. Have the discovery summit, if you realize tomorrow that you misspoke or you wanted to present something in a slightly different way. You can be live on the when your presentations going, and you can ask the person presenting a deposit and then you can say you know i'm about to say this, what I what I wanted to convey is that you can kind of like. edit in real time during the presentation so don't stress about getting every word perfectly just relax and and go through it and and i'm sure will be just fine. All right. yeah. Peter Polito All right, i'm gonna mute and turn off my camera and then you go ahead and begin okay. Thanks Peter. Hello, I'm Suling. I'm a master's student at Singapore Management University where I'm currently pursuing a course in data analytics at the School of Computing and Information Systems. So I'm actually here today to present an assignment that has been submitted for my master's in IT for Business program and, more importantly, I want to share my JMP journey so far. So I started using JMP this year and I really fell in love with it because of the ease of use and the range of statistical methods and the visualizations that I could do on it. So, as the beginner using JMP I'm really honored to be here presenting my report and do let me have your feedback, because I feel that I have to so much more to learn yeah. So the motivation for my paper was actually to look at the COVID 19 vaccines, so we know how important they are but at the unprecedented rate at which it was developed and administered has raised some doubts in the community regarding its safety. So we are using data from the United States vaccine adverse event reporting system, yes. So we are using data from there because we find that there's a potential to help determine if the safety concerns on the vaccine are founded. So this paper makes use of both the structured as well as the unstructured data from VAERS to model, the adverse reaction of COVID-19 vaccines. So what is VAERS? So the Center of Disease Control and Prevention and the US FDA have had this system, and it is actually a adverse event system where it collects data. But generally what we see is that VAERS data that cannot be used to determine causal links to adverse events because the link between the adverse event and the vaccination is not established. So what we actually see here is that you have people who are reporting, but there, they are people who are reporting the events, but then there is no full of action that is to confirm that these symptoms and events that are reported, are there any link to the vaccine. So why do we still want to use this data? So firstly the data is available and public domain. The data is up-to-date and, more importantly, not all adverse events are likely to be captured during the clinical trials due to low frequency. So usually for clinical trials, they include only the healthy individuals. So special populations, like those with chronic illnesses or pregnant women, these are limited so the they know that VAERS is an important source for vaccine safety. So for more information regarding it, you can look up this link over here. Yeah so the data set used for this study comes from tree data tables that extracted from VAERS. The first one is the VAERS data. It mainly contains information about patients profile and the outcome of adverse events, so what I have here is a little clip from JMP where we have here the symptoms text and you can see that this is just one report based on one person, one one patient. Okay, and the data is quite dirty. There is a lot of useful information in the narrative text, but you can see that there are spelling errors, typos, excessively long or even like a very brief statements. So the next two data sets that we have is the VAERS of vaccination data, as well as the symptoms. So one contains information regarding the vaccine, the other one is extracted from the symptoms text that we can see. Okay so given this accessibility, actually VAERS data has been mined quite a lot by the by quite a number of researchers, but, as you can see that the data is actually very challenging to use as the quality of the report varies. And there's also something that might not be genuine. So review of the power, which shows that some form of manual screening is usually employed to extract the required information. However, this is also quite labor intensive and quite difficult, so this paper aims to showcase the methods to extract the key information using text analysis techniques in JMP and try to do an explanatory model to explain the most important variables involved in this event. So what we did is that for each of these data tables over here we clean them individually and then join them using the VAERS ID. What we did based on the patient outcomes was to derive something that's called a severity rating, I'll talk a little bit about this a bit later. So once the tables are joined there are four narrative texts. One on the allergies, medication, medical history and symptoms. And then we will use text analysis techniques to extract the vectors for the top terms that...will explain the severity rating for each of the text data. And join them in the existing spreadsheet data structured variables on the data set. Okay, and then all this is compiled together and then you put it into model building. Okay so what is this the severity rating all about? It's based on the patient's outcome, the VAERS data has 12 variables that describes the status of the patient. And then, based on this, we have extracted the variables and try to make sense of it, so we came away four levels of severity and then we call this the severity rating. So next we will talk a little bit about how we use JMP Pro Text Explorer platform for text analysis and we start off with the data cleaning. So what we wanted to do was to really extract out the significant terms from the text data. And augment them to the structured variables to build your model. So as you can see, actually, the text data is quite quite messy so what we did was first of all, decide between using the term frequency, what kind of term frequency to use, and then the binary term frequency was selected, as the data shows that there's a significant advantage, of considering, of using it. So next a little bit about the cleaning that came in. So the the text data was first organized using the JMP Pro Text Explorer and we used a useful feature that is in there to add phrases and automatically identify the terms so what you can see it's like terms like white blood cells are kept as a phrase instead of being pulled into white blood and cells, which will not make much sense. And a few other methods as well to use. So one is the standard for combining which stemmed the words based on the word endings and then we also thought to sort the list alphabetically in order to recode like misspelled words or typo errors or what's that similar. Yeah and then the next thing was to use the very handy function to recode all the similar items together yeah. So the next thing we do after cleaning out the text was to look at the was the look at the workflow actually. The workflow is useful for stop word exclusion and to see the effect of the target variable on the terms. So what I did over here was to visualize the most frequent terms by the size and color it based on the severity rating. So you can see that the lighter colors belong to the less severe cases and darker ones are the other most severe ones, and you can like pick up, then the words is quite small. And it really shows that the common symptoms are not serious but we picked up terms by the cerebral vascular incident pulmonary embolism and things like that. So these are related to the most severe adverse event. The next thing we use the term selection, so the term selection is new feature JMP Pro 16 which, which was quite timely. So it is integrates the generalized regression model into text analysis platform so following from the text analysis platform, you can just select this where term selection is. And then, it allows the identification of key terms that are important to the response variable. So our response variable is the severity rating. So why use the generalized regression model? So it is widely used for non normally distributed or highly correlated variables. Where the data are independent of each other and show no other correlation. So this method over here is useful for us because it fits our our data set. And each role that we have inside our data table is a patient and all those are independent of each other. So, and then the most important thing is also that the generalized regression model allows for variable selection, so that is what we want to do because we want to pick up the variables with the highest influence on the response variable. Yeah so a little bit more detail about this regression model there's a few options that we can use over here that's the elastic net, as well as the lasso. So over here are the different thing about these is the lasso tends to select one term from a group of correlated factors, whereas the elastic network net will select a group of terms. So generally, I think that elastic net is used, and then over here that's our choice of the binary term frequency came in. Okay, so this is the result of the term selection, so you can see that over here that shows you the overview of the (???) and then generalized (???) but more interestingly when you started by the coefficient you can see that these are the top positive coefficients So these are top factors that plays the biggest role in terms of our response variable. And this one over here are the symptoms that plays least role when sort that according to the coefficient. So looking at the results, you can see that cardiac arrest and COVID-19 pneumonia, cerebral vascular accidents just all the terms that affects the response variable. So we can see that terms of more serious nature are related to the heart and lungs okay as versus the more low frequency ones right, which are very, very mild symptoms, really. Okay, so we repeat this whole process for all of the other for all of the other text variables, so we have gone to the the example for symptoms, so there's also the allergies, the medical history as well as the medications that are used, so what we did later was to save the document term matrix. Okay, which is basically the DTM is saves a column to the data table for each time. So you can see over here an example, you mainly a lot of zeros because it's a very sparse matrix so one will indicate the presence of let's take this column over here one will will indicate the presence of (???).. So we save it and repeat the process for the other text analysis. And then, once we have all these terms saved up we moved on to modeling. So therefore modeling was to build in kind of like a validation column so over here, we went to predictive modeling and make validation column. So over here we selected the choices so put it as validation set up 55%. And the whole thing over here was to identify the important variables with severity as a response variable So all in all we have seven structure variables and 55 that were derived from document term matrix and a total of about 31,000 rules. And what we see is that there's an imbalance there because of a severity rating you get an unbalanced data set. So because of this, we done our model evaluation on comparing the R split and the AIC values. So. We use the fit model in JMP so and choosing the generalized regression model again. And we can see that these are the results here so separate models you think the group of generalize linear models, using the penalized regression techniques were prepared. And then we try to fit based on the various characteristics over here, these are all the other other the penalized estimation methods. So of all the models, we can see that the lasso method has the lowest sorry has the highest Rsquare value, and there are other values that quite close as well, so we are going to take a closer look at them. So comparing the maximum likelihood model, as well as the lasso model based on the ROC curve, you can see that actually both of the ROC curves are quite similar. And however the ROC curve for the maximum likelihood model shows that it has the highest severity. Sorry, ROC curve for the maximum likelihood model shows that the ROC value is higher for highest severity rating, and you can see that it's only a slight difference here between both of these. And in general as as you go down the severity rating the area actually do increase and one of the reason is because our data set is very unbalanced. So the severity rating of four, which is the highest level, the most of your level is only about 5% of the total data values okay so overall this actually very little difference between both of these so we choose the one with a slightly larger area. Okay, so our next be turned on to the effects test. So into the report, you can choose to see the effects test, so the effects test is the hypothesis test of the null hypothesis that the variable has no effect on the rest. There was this very nice explanation of the effects test on the JMP Community, I think it was contributed by Mark Bailey. So he talks a bit about how the effects test is actually based on the Type III sum of squares for ANOVA. So we can see that the effects test is very suitable over here because of our data set so it actually tests every term in the model in terms of every term in it. So the main effects are tested in a lack of the interactions between the items and in the light of the other terms, in the light of the other main effects as well. So what do you want to use here is that we see over here is that the effects test is useful for our purpose, as it is for model reduction. And, and it allows us to draw inference of the long list of significant variables. We look at the probability at ChiSquare (???) lowest ChiSquare value taking a cut off alpha value of 0.1. We have a number of independent variable so that's quite a long list of them, and most of them, as you look through most of them actually related to the cardiopulmonary illnesses. So some of them are the effected ones like the number of symptoms, the number of days between the vaccination onset. (???) is the more in which the vaccine centers that by each and then you can see that the rest of them are related somehow another to. cardiopulmonary illnesses, there are some strange ones that I don't come from a medical background, so I don't really understand it either, but you can see that deafness is one of them, so there are some strange results that we can see over here, but in general that's, the picture is that, in terms of the top variables in terms listing variables. Okay besides this, right, what is really interesting is look at the model evaluation so even though what we're doing is to build an explanatory model, I went to look into the predictive model as well because JMP has very nicely put report over there for me to look at the parameter estimates so So I use the Profiler to try to understand the parameter estimates and you can see over here that the values shown are really, really small, so this is the value that you get immediately when you open up the Profiler, so the values here are the average of each variable and you can see that each of these variables Of the each of these values here actually very small, so it means that there's very little effect on the severity based on these coefficients. Based on these as a coefficient of the predicted variables. So what you can see that based on this study over here, you can tell that actually (???) symptoms and its effect has very, very little effect on severity and this is kind of like. kind of a within expectation to see that most of these symptoms and effects, because we are looking at the general picture of the vaccine, we can see that most of these symptoms - medical history, allergies - have very little effect actually on the outcome of the of the vaccine. Okay so. yeah. So a little bit of a conclusion, a few statements as a concluding statement. Several decisions were made in the grouping and classification of variables. And although these variables were made to the best of our understanding, especially in the way in which we came up with a severity rating, We perhaps need an expert familiar with vaccines studies or clinical trials to be consulted as to whether or not the severity rating is sufficient to to score the adverse events outcomes. And based on the model building of structured and unstructured data we have identified key factors that varies with the severity in a reaction to a COVID-19 vaccination. However, we're still not the effect of these key variables on the response variable severity is very small, so this is seen by looking at the variables. And then, finally, the document term matrix based on the binary ratings, the binary term frequency was found to be the most effective in representing the weights other terms in the document. And the generalized linear model with the lasso penalized regression technique produced the optimal model. So I hope you enjoyed the very short presentation and do let me know if you have any questions or any feedback, thank you very much. Peter Polito great job. was very. Oh no. I just realized that i'm you know mistake we wanted this life oh gosh. Peter Polito So that this is the exact situation where well you're. So at the actual discovery summit there's going to be a presenter. And so they're going to reach out to you, ahead of time and you just say, I have a mistake on one of my slides, and so in this part comes up it'll just pop he or she will pause it. And then you can share the slide and talk about it and then go right back to the video so don't worry about it at all. This happened quite a bit during last year's discovery summit is not a problem at all. Okay okay. Well gosh oh James. Peter Polito Would you like to fix it and redo it would that make you feel better. I don't know. If it's actually this one here, because this is the wrong box, it will take me a while to actually fix it because I need to retype it over yeah. Peter Polito yeah then it didn't know don't worry about it it'll be a real easy fix and you can do it in real time. Okay okay. Okay yeah. Thanks for sitting, through it, though. Peter Polito yeah no problem is great, I really. Okay. Peter Polito All right, any other questions or comments or anything. yeah i've got one is that um so what's up a link, where I can upload all of my slides and my people and things like that, but there was a mixup with my. email, and I think from tanya about that that when tanya replied me right, I think she missed out on that link, so I thought that the link will be embedded inside one of the recording but I don't suppose you guys got it right. Peter Polito I don't have it, but I will reach out to tanya and asked her to reach out to you directly. To help remedy that. Okay sure thanks very much I think that makes up with my email yeah. Peter Polito Thanks very much. No problem alright well have a have a good night and good rest. yeah you have a good day. Peter Polito Thank you. So much bye bye.  
Amanjot Kaur, Statistician, Perrigo Rob Lievense, Senior Systems Engineer, JMP   Formulation scientists put forth significant hours of work attempting to find an extended release formulation that matches drug release targets. For generic (ANDA) products, the primary objective is to match critical quality attributes (including dissolution over time) to the reference listed drug (RLD). This paper illustrates how functional DOE is an extremely robust and easy-to-use technique for optimizing to a target dissolution profile rapidly with fewer resources. Dissolution data is collected over time during development and in-depth analyses are required to understand the effect of the formulation/process variables on the product performance. Regression models for specific time points (e.g., 1, 2, 6, and 12 hours) have been typically used to correlate the release responses with the formulation/process variables; however, such analyses violate the assumption of independence for the responses. There is great need for robust statistical tools used to determine the levels of inputs needed to get the closest profile of the developmental product to the reference drug. Functional DOE in JMP for model dissolution is a new and critical tool to use within a drug development program.     Auto-generated transcript...   Speaker Transcript Bill Worley or.   Is.   It is.   Thank you so much for tuning into today to watch Rob and I's Discovery presentation. Today we are discussing the topic, a new resolve to dissolve,   which is modeling drug resolution data using functional DOE. First of all let me introduce myself. My name is Amanjot Kaur. I am statistician with Perrigo Company.   And my co author, who's joining me here, is Rob Lievense. First of all, we'll talk about...I'll give you a little bit introduction about the topic that we are discussing today,   followed by introduction to the dissolution testing and why it is important. And then we'll discuss and compare the previous methods that we used   for finding the best input to match the target dissolution profile and compared it with functional DOE that is now available in JMP Pro.   Like I mentioned, I work as a statistician with Perrigo, which is a pharmaceutical company and very, very good...has a large share...market share of over the counter drugs and generic drugs.   We regularly deal with the solid dosage form, which are developed into new products.   Formulation scientists in these pharmaceutical companies, they put in a lot...a lot of time and effort in attempting to find candidate formulation to match target dissolution profile of an extended release   tablets. These days extended release tablets are really...they are more...   more in demand, as compared to immediate release, as you can see in this slide, that if you're taking an immediate release tablet you're in 24 hours, you will be taking   eight to 10 tablets, and as compared to extended release tablets, you will be just taking two tablets in 24 hours. So the extended release tablets are preferred over immediate   release tablets. When I say target dissolution profile that can be a currently marketed drug known as orallly or reference listed drug, or it can be a batch that is used for clinical study, known as a bio batch as well.   And the data that is collected during the formulation development of generic products, it is all submitted to FDA in NDA (abbreviated for new drug application), which is really common in my work.   The main primary product objective during these this formulation development is to match all the critical quality attributes, including dissolution over time to the ???.   So let's take a minute and see...learn little bit about dissolution testing and why it is important.   When we take any medication what happens in the human body is that solid dosage form, those will release the active drug ingredient, and it will process...and will process the drug out of the body in a given rate. This is shown in the clinical studies.   That a peak results when the maximum amount of drug is present in the blood, how long that those are sustained and the eventual decline of the drug in the bloodstream as it is excreted out of the body.   The laboratory methods that are utilized to monitor the quality of the product, they do not have the same mechanism, but they try to replicate as much as   to the human body. And there are multiple techniques that are used, however, all of...all them typically involve the release of the active drug ingredient   in media, and that is measured as a percentage of the total dose. In the extended release formulas, they utilize formulation scientists, they utilize materials and   processing methods to assume that a specific amount of drug is released quickly enough to be effective, with the slow release required over the time to maintain the drug level against the rate of the excretion.   Now that we know about dissolution testing, now we can take a look at the methods that we used previously to analyze this dissolution data.   Similarity profile...similarity of profiles that is...that was done by graphing average results of candidate batches to the product. We usually use two methods. The first one is F2 similarity criteria,   which is used to compile the sum of square differences of percent released in media of multiple...for multiple time points.   Scientists typically they rely upon their principles and experience to create trial batches that will hopefully be similar to the target, so it's hit and trial method.   In this method a value of 50, or higher than 50, that is desirable to indicate that batches...the batches are, at most, plus or minus 10% different from the target profile at the same time points.   The second approach that we use is more advanced approach that came about through utilization of quality by design in pharmaceutical industry. That is multivariate least squares model that comes from designed experiments. So,   as you may know, least squares methods create an equation for how each input...input influences the dissolution output   for designated time points. So for extended release formulas we typically look at one hour, two hours, six hours and then 12 hour time release.   The prediction profiler that is available in JMP that provides the functionality to determine the input settings to obtain the comparison of the best results for all the solution time points.   So now question is, if we have these capabilities, why we are looking at the new approach?   The problem with these methods that we have today is that F2 similarity trials and multivariate least squares, both methods they treat the   time points of the dissolution profile as independent outputs, and we know that the release of those at one hour that will affect results at the later time points as well.   That's why we need a new approach, and, secondly, the functional DOE, it will treat all the time points as dependent time points, and it is an extremely robust and easy-to-use technique to optimize our target dissolution profile rapidly just with few resources.   Let me just quickly show you in one example of development project using multivariate least squares regression method. So, as you can see here in this data table, this is...this was a DOE created for one project, and we have 12   batches (12 batches?) 12 batches in here. The first, the main compression force, polymer A and B there, these are the three input factors, and   different time points at 60 minutes, 120 minutes, 240 and 360. We have all the time points here. And, if we look at the least squares fit here, you can see, our main effects and interaction, they all are pretty significant and if you just scroll down to the   to the end of the report, you'll see a prediction profiler will...where it will give you...where we have all the setting....we have already set goals, what we want and we can maximize our   desirability and it will give you a setting showing you that this is the desired setting that you need to get your desired profile or a match to the target,   if you want to say. So this is what we get in least square fit. My former colleague, Rob Lievense, will show you functional DOE, which we believe is a way...much better way to optimize the formulation or process.   Thanks AJ. You did a really good job of explaining all the work that we did changing to a quality by design culture and getting to the multivariate models.   Now we're ready for the next step. I'm Rob Lievense and I am a senior systems engineer at JMP but I used to be in the pharmaceutical industry for over 10 years and wrote a book on QbD and using JMP.   So I want to show you this topic of functional data exploration, specifically functional data using a DOE.   This works really well for dissolution data.   For functional data analysis, we need to have the table in stacked form, that works best. So I have a minutes column, I have the dissolve for the six samples at each time point, I have the batch and I have my process inputs.   Now what I'm going to want to do is take a look at this first   as dissolution by minutes. So one of the things I can see is, here's my goal, here's all the things that I'm trying with my experimentation. It really becomes obvious that   these are dependant curves. Whatever happens in early time points has influence on later time points, so it really is silly to be able to try to model this by just pulling in the time points and treating them as independent.   We can utilize more of the data that comes from the apparatus in this way.   This helps us develop the most robust function; the more pulls we have, the better it's going to be.   So I'm going to run my functional data exploration here.   We have   the amount dissolved.   We have to put in what we have across X, if it's not in order...row order, but I have minutes, I'm going to throw it in there.   Batch is our experimental ID.   And then these inputs that change as part of our DOE, we're going to put in there as supplementary variables.   What JMP does is it looks at the summary data. We can see that the average function is this kind of release over time, which makes total sense to me.   I also can see that I have a lot of variability, kind of in that   60 to 120 minute time frame, which is fairly common.   And I have some ability to clean up my data, but I happen to know this data is pretty solid, so I'm not going to mess around with that. What I do need to do is tell JMP which is my target, and my target is my reference listed drop.   Now I'm ready to run a model.   There are various models available, but I've used b-splines with a lot of dissolution data, and it seems to work very, very well.   What JMP is going to do is it's going to find the absolute best statistical fit.   This doesn't make any sense to my data, I know that my concentration of drug in media   grows over time. It never dropped, so having these inflection points within the sections that I picked this function apart make no sense. All these areas are knots, and this is how we break apart a complex function into some pieces, to be able to get a better idea of how to model it.   Well, I can fix this. I know that cubic and quadratic just make no sense, and I happen to know that six knots is going to work quite well, so I'm going to toss that in there. I can put as many as I want.   Now JMP still gives me those nine knots. I need to have some subject matter expertise here. I think I can do this in six. I can see I don't gain a whole lot of model fit by going beyond six.   But I do want enough saturation in this lower area, because this is where dose dumping might occur. This is where I'm really interested in determining   if I'm having any kind of efficacious amount in the bloodstream. So I'm going to set that update.   Now that one's not so great. I'm going to try again.   Alright, so I get a very reasonable fit for this setup and I've got my points really where I want them. And I take a look at that, that makes a lot of sense.   Now what JMP has done mathematically is it's seeing for this average function, there's an early high rate of increase, that's 83% of the explanation of the shape of this curve, what changes the shape of this curve, if you will.   And I also see there's about a 15% influence of a dip. And I can tell you this is likely due to the polymers; I have fast and slow acting polymers so that makes total sense.   And then we have another one that's maybe a very deep dive.   Now we can play around with this if we want, but I'm just going to leave this with the three Eigen functions.   Now this is my functional data analysis, the prediction profile we get is expressed in terms of the functional principal components, which can be somewhat difficult to interpret. We're going to move forward and we're going to launch the functional DOE.   We do that, we can see that our inputs are now the inputs to the process, so we can see how changes to these inputs have an effect on our dissolution.   But what we want to do is, we want to find the best settings. So what we can do is go into the optimization and ask JMP to maximize. And what JMP is going to do   is it's going to find the absolute minimum of the integrated error from the target RLD. That's going to be the absolute closest prediction to our target.   We can see that we have about 1,800 compression force, about 12% Polymer A and about 4% Polymer B.   And as we move these to different points, we can see   what the difference is from target, so this one has about 1.54 for 60 minutes.   And at 120, it drops to negative 1.2.   At 240, we get to negative .7. So it gives you an idea of how far off we are, regardless of where we are on the curve.   It's time for a head to head comparison.   Since we have more points in our functional DOE, we're going to use this profiler to simulate because I don't have the ability to make batches, but this is going to be as close an estimation as we can get.   In the simulator, we can just adjust   to allow for what we can see happening in the press controllers, as far as the amount of variation we see in main compression force, and we can adjust for some variation in Polymer A and Polymer B.   Once we've done that,   we can run five runs with the FDOE settings that are optimum and we can run five runs with the least squares optimum settings, and let's see how they compare.   The simulations allowed us to compare what is likely to happen when we run some confirmation runs.   And the thing that we see is the settings shown for the least squares model, which assume independence, are really not the settings we need. We have some bias. We don't get a curve that is as close to the target as we possibly could get.   We use the FDOE, our optimum runs are very, very close to the curve and you can see that our main compression, our polymers are quite a bit different between those two results.   Thank you, Rob, for explaining that. So there are some considerations that we need to keep in mind when we're using these methods. First of all, a   measurement plan that should be...must be established with the analytical team or the laboratory team to ensure that there are enough early pulls   of media to create a realistic early profile. And when we say early profile, it's before 90 minutes or so.   And secondly, the accuracy and precision of the apparatus, that must be established to know the low limit as very small amounts of media may not be measured accurately.   And the third one is the variation of the results within the time points, that must be known as high variability if it's more than 10% rst because that may require some other methods.   So, now that we have established this method for the next steps, we would like to establish a acceptance criteria.   We found that the model error of...for the functional DOE that seems to be greater at the earlier time points.   That may be due to the low percent dissolved and the rapid rate of increase, so that is creating the high variability.   And the amount of model error is critical for the establishment of acceptance criteria. There is likely a cumulative contribution of FPCs   low...too low for practical use and the integrated error from target might provide evidence for acceptance.   And the last one, creating a sum of squares for the difference from target of important time points that could allow for F2 similarity used for acceptance. However, some more work is needed to explore this concept.   Well, thank you so much for joining us and we hope that this talk, this approach will be useful in your work.   And we're going to be hanging out for live questions but we're very interested in your feedback on this method, especially any ideas on how to establish acceptance criteria.
Laura Castro-Schilo, JMP Sr. Research Statistician Developer, SAS   The structural equation modeling (SEM) framework enables analysts to study associations between observed and unobserved (i.e., latent) variables. Many applications of SEM use cross-sectional data. However, this framework provides great flexibility for modeling longitudinal data too. In this presentation, we describe latent growth curve modeling (LGCM) as a flexible tool for characterizing trajectories of growth over time. After a brief review of basic SEM concepts, we show how means are incorporated into the analysis to test theories of growth. We illustrate LGCM by fitting models to data on individuals' reports of Anxiety and Health Complains during the beginning of the COVID-19 pandemic. The analyses show that Resilience predicts unique patterns of change in the trajectories of Anxiety and Health Complaints.     Auto-generated transcript...   Speaker Transcript Lauren Vaughan See.   Okay, we are recording okay all right, if you could please confirm your name company and abstract title. Laura Castro-Schilo, JMP Laura Castro shiloh JMP SAS and my abstract okay so it's a modeling trajectories with structural equation modeling and that's it right. Lauren Vaughan yep and um let's see you do understand this is being recorded for use in the JMP discovery Summit Conference, and you will, it will be available publicly in the JMP user community do you get permission for us to use this recording. Laura Castro-Schilo, JMP Yes. Lauren Vaughan Excellent OK, I will turn it over to you Laura. Laura Castro-Schilo, JMP Right, and so I just.   have to share my screen somewhere here right.   Now.   But. Lauren Vaughan perfect. Laura Castro-Schilo, JMP two seconds.   Hi, everyone. I'm Laura Castro-Schilo and today we're talking about modeling trajectories with structural equation models.   And we're going to start this presentation by first answering the question of why we would use SEM for longitudianal data analysis.   And then we'll jump into a very brief elevator version of an introduction to SEM. If this is your first exposure to SEM, I strongly encourage you to look for some of our previous presentations in Discovery Summits   that are recorded and available for you to watch, so that you can get a better understanding of the foundations of SEM.   But hopefully even without that intro, hopefully, this brief version will set you up to understand the material that we're going to talk about here today.   And that introduction we're going to focus on how we model means in structural equation models. We're going to see that means allow us to extend traditional SEM into a longitudinal framework.   And modeling those means will have implications for how our path diagrams look like, and also we'll see how those diagrams map on to the equations   in our models. We're going to focus specifically on latent growth curve models, even though we can fit a number of different longitudinal models in SEM.   And then we'll use a real data example to show how we model trajectories on anxiety and health complaints during the pandemic.   And at the end we're going to wrap it up with a brief summary, and I'll give you some references in case you're interested in pursuing some longitudinal modeling and you want to learn more about this topic.   So Singer and Willet are two professors from Harvard's School of Graduate Education, and I think they said it best   when they claimed in a popular textbook of theirs that SEM's flexibility can dramatically extend your analytic reach. Indeed, this is probably the most important reason why you might want to use SEM for longitudinal data analysis.   Now, specifically when we're talking about flexibility, we're referring to the fact that you can fit a number of different models   in SEM that are longitudinal models and can be quantified in terms of fit and can be compared empirically, so that you can be sure that you're characterizing your longitudinal trajectories in the best possible way.   There's a number of different models that we can figure, you can see them listed there and we can,   you know, things like repeated measures ANOVA, which can make some pretty strong assumptions about the data. SEM allows us to relax some of those assumptions and actually test empirically whether those assumptions are attainable.   SEM is also really flexible when it comes to extending the univariate models into a multivariate context. So if you're interested in looking at how changes in one process influence or are associated with changes in another process, SEM is going to make that very easy and intuitive.   Now we know SEM has a number of nice features, and all of those apply in the longitudinal context as well. Things like the ability to account for measurement error explicitly,   to be able to model unobserved trajectories by using latent variables and also using a cutting edge estimation algorithms for when we have missing data, which actually happens pretty often when we have longitudinal designs.   Another interesting feature is that it allows us to incorporate our knowledge of the process that we're studying. So we'll see that that prior knowledge about what we expect the functional form in our data to be can be mapped onto our models in a very straightforward way.   But there's also reasons why we should not use SEM for longitudinal analysis.   I think, most importantly, the structure of the data is what might limit us the most. So in SEM we're going to be   required to have measurements that are taking up the same time points across all of our samples. So say if, for example, we're looking at anxiety and we have repeated measures...   three repeated measures over time, the structure of the data have to be like what I'm showing you here right now, where, you know, we might have anxiety at one occasioin and that's   represented as one column, one variable in our data tables, and then we have anxiety at a second time point and at a third time point.   So what this means is that everybody's assessment of the first time point has to have taken place at the same time, and that's not always the case. And so there's going to be other techniques that are more appropriate if, in fact, your data are not time structured.   We also have to acknowledge the assumption of multivariate normality. Sometimes we can, you know...   SEM might be a little robust to this assumption, but we still need to be very careful with it.   And it's also in large sample techniques. So that data table I just showed you, you know, we really want to have substantially more rows than we have columns in the data, and this might not always be the case.   So just as a reminder, if you haven't been exposed to SEM, this is also a nice brief intro,   is that in SEM, well, one of the most useful tools are called path diagrams, and these are simply a graphical representations of our statistical models.   And so, if we know how diagrams are drawn, then it'll be much easier for us to use them to specify our models and also to interpret   other structural equation models. So these are the elements that form a path diagram and you can see here that a square or rectangle are used exclusively to denote manifest or observed variables in our diagrams.   And that's in contrast to unobserve variables, which are always represented with circles and ovals. Now arrows in path diagrams are...   they represent parameters in the model. So double-headed arrows are always going to be used for variances or covariances, and one-headed arrows represent regressions or loadings.   In the context of longitudinal data, there's another symbol that is really important, and that is a triangle. The triangle represents a constant, and it's used in the same way that you use a constant in regression analysis,   meaning that if you regress a variable on a constant, you're going to obtain its mean. So we model means and we put some   constraints in the mean structure of our data by having a constant in our models. So let's take a look at a simple regression example.   If you wanted to, you know, fit a simple regression in SEM, this would be the path diagram that we would draw. So you can see X and Y are observed variables, we have X predicting Y, with that one-headed arrow, and both X and Y have   variances. In the case of Y, because it's an outcome, that's a residual variance.   And we also have to add the regression of Y on the constant if you want to make sure that we get an estimate for the intercept of that regression. So here, this arrow would represent the intercept of Y, and notice that we also have to regress X on that   constant in order to acknowledge the fact that X has a mean.   And now we can use some labels, so that we can be very explicit about which parameters these arrows represent. And then we can see how those   arrows...so we can trace the arrows in the path diagram in order to understand the equations that are implied by that diagram.   So let's focus first on Y. You can see that we can trace all of the arrows that are pointing to Y in order to obtain that simple, you know, regression equation. We have Y is equal to tau, one...times one (which is just that constant so we don't have to write the one down here)   plus beta one times X (which we have right down here) plus the residual variants of Y.   Now we also can do the same for X, because in SEM all of the variables that are in our models need to have some sort of equation associated with them. And here we want to make sure that we acknowledge the fact that X has a mean, so we regress that on the constant and it also has a variance.   So again, those path diagrams are away to depict the system of equations in our models.   And those diagrams, it's very important to understand that they also have important implications for the structure that they impose on the various covariance matrix of the data and on the mean vector.   And I think it's easiest to explain that concept by actually changing the model that we're specifying here. Rather than having a regression model,   what I'm going to do is, I'm going to fix all of those edges to zero. So all of these effects, I'm just going to say I'm going to fix them to zero,   which is the same as just erasing the errors from the diagram all together, and now you can see how the equations for X and Y have changed there. This is a very...   well, this is a very interesting model. It's simple, but it actually has a lot of constraints, right, because it implies that X and Y have a variance but that their covariance is exactly zero; there's nothing linking these two nodes.   And it also implies that the means for both X and Y are exactly zero, because we're not regressing either of them on the constant in order to acknowledge that they have a non zero mean.   So now we if we really want to fit this model to some sample data, then that means we have some samples statistics from our data.   And the way that estimation works in SEM is we're going to try to get estimates for our parameters in a way that match the sample statistics as closely as possible, but still   retaining the constraints that the modal imposes on the data. And so, in this particular example, if we actually estimate this model, we would see that we are able to capture the variances of X and Y   perfectly, but the constraints that say that the covariance is zero and that the means are zero, those will still remain. And so the way in which we   figure out whether our models fit well to the data is in fact by comparing this model implied covariance and means structure to the actual sample statistics, and so we can look at the difference between those and obtain our residuals.   And these residuals can be further quantified in order to obtain metrics that allow us to figure out whether our models fit well or not.   Okay, so that's our intro for SEM, and these are going to be the concepts that we're going to be using throughout the presentation, in order to understand how we model trajectories with SEM.   Now what better way to start talking about trajectories then to imagine some data that actually have some trajectories. And so I want you to think for a second,   how anxious are you about the pandemic? If it had been asked of you early in 2020, when the pandemic was first started.   And perhaps a group of researchers approached, you they asked this question, and then they came back a month later and asked you the question same question again.   And maybe they came back a couple months later, and also asked about your anxiety. So we might obtain this data from a sample of individuals and the data would be structured in the way that is presented here, where each of those time points would be different variables in in the data.   And now let's imagine that we have the interest of looking at some of the trajectories from that sample, and we want to plot them so that we can start thinking about how we would describe these trajectories.   So let's take three individuals. This is going to be a fabricated example just to illustrate some concepts, but imagine that the first individual gives us the exact same score of three at   each of the time points that we asked this question. And maybe in this example, maybe anxiety ranges from zero to five, where five means there's...you're more anxious about the pandemic. So for this individual, the trajectory of this person is perfectly flat, right. It's a very   simple trajectory, and maybe for a person...individuals two and three, you know, maybe we get the exact same pattern of responses.   And so, if this were to be real, and we had to describe these trajectories to an audience, it would actually be really easy   to do that, right, because we could just say, there's zero variability in the trajectories of individuals, and really, just describing a flat line would would do the rest, right. So we can use the equation of the line to say, you know, anxiety at each time point takes on   these values. And we would have to clarify, right, that the mean, or rather the the intercept for this line is equal to three and the slope is zero, so that we really just described that flat line.   Well that'd be a really easy to do, but of course this is a very unrealistic pattern of of data, so we're not expecting that we would observe this in the real world.   So let's imagine a different set of trajectories where there's actually some some variability on how people are changing.   And in this case, we could still find an average trajectory, right, a line of best fit through these data. And if we only use that the equation of that line to describe the data,   we would really be missing the full picture, right. That would not do a very good job of showing that some individuals, you know, number one is increasing, whereas   individual three is is decreasing. So instead we have to add a little more complexity to that equation we saw earlier, in order to account for the variability in the intercept and the slope.   So again, if we had to describe this to an audience, one thing we can do is in this equation, I'm adding a sub index I   to represent the fact that anxiety for each individual at each time point can take on a different value.   Now notice that the intercept and the slope for the equation also have that I,   indicating that we can have differences...you can have variability on the intercept and the slope, and we can still use the average trajectory to describe the average line, right, such that that intercept   can still be three and the slope is zero. But notice that we add these additional factors here that capture the variability   of the intercept and the slope, and, specifically, these are the values for each individual that are expressed as deviations from the average trajectory.   And then we'll see that we're going to have to make some assumptions about those factors in terms of their distribution, which should be normal with a   mean of zero and an unrestricted covariance matrix.   But even these trajectories are also quite unrealistic, right, because I'm showing you these perfectly straight lines. And when we get real data, it's never ever going to look that perfect. Indeed,   these three trajectories are much more likely to look like this, right, where even if we are assuming that there is an underlying sort of an unobserved linear trajectory,   those are not the trajectories we observed. In other words, we have to acknowledge that any data that you observed at any given time point is going to have some error, right. And so   we're still able to capture that error into our equation and we'll make some assumptions about that error being normally distributed.   But again, the idea is that we have these unobserved error free trajectories and that's not what we really get when we are observing the individual assessments, right, into in our data.   So our equation is going to describe that average trajectory and it's also going to describe the individual trajectories as departures from the individual line...I'm sorry, from the average line.   Alright, so not everything that we have described so far is actually known as a linear latent growth curve model in SEM.   And if this looks like a mixed effects or random coefficients model, if you're familiar with those, it's because it is actually very, very similar.   Now we only have three time points here, so this is a very simple linear growth curve, but we can still have,   you know, more complex models that incorporate some nonlinearities if, in fact, we have more time points so that we're able to capture those nonlinearities, and we can do that for polynomials and there's other ways actually to capture nonlinearities in growth curve models.   Today we're going to keep it very simple and we're going to stick to the linear models, though.   All right now, I want to bring it all together by really showing you how those equations of that linear latent growth curve model, how they can be mapped to a path diagram that can be used to fit our structural equation models. And so we're first going to start by   by using the simplest equations here, the equation for the intercept and the slope. And remember that that intercept and slope represent unobserved values, right, represent   unobserved growth factors, and so we're going to use latent variables, these ovals, to represent them in our path diagram. And notice that the intercept is equal to a mean, plus that   variance factor, right. And so that is why we regress the intercept on the constant in order to obtain it's mean, and we also have this double-headed arrow in order to represent that variability in the intercept.   And we do the same for the slope. Now notice, we also have a double-headed arrow linking the intercept to the slope, and that is to represent the covariance, right, that we make that assumption over here.   And it just means that we're going to acknowledge that individuals that perhaps start higher on a given process might have an association to how they change over time, okay, and that is what this covarience allows us to estimate(?).   Now, ultimately, what we're modeling is our observed data, right, our observed measurements for anxiety, and so here is the full path diagram   that would characterize the linear growth curve. And notice, I'm going to focus on one anxiety time point first, that first time point,   and again using the idea of tracing the path diagram, we can see how anxiety at time one is equal to one times the intercept, which is right here in this equation,   plus zero times the slope, so this part just falls out, plus that error term. So in other words, what we're saying is that anxiety at time one is simply going to be the intercept of that individual plus some error.   And then we can do the same, tracing the path diagram, to see what's the equation for anxiety at the second time point. You can see that it's, once again, that intercept plus one times the slope, so here is basically the   initial value of that person, the intercept, plus some amount of change.   And then, at the third occasion is basically again the...just by tracing this, we see that the equation implies that we have a starting point, which is the intercept plus now is two times the slope.   Right, so notice how in these latent variables, the factor loadings are fixed to known values, and we are fixing those values to something that forces these trajectories to take a linear shape.   So here the factor loadings of that slope is basically the way in which time is coded in the data, and this is the reason why everybody in SEM actually needs to have the same   time point for a measurement, right, because everyone that has the value of anxiety at time one is going to have that   same time code, which is embedded into the way in which we fix these factor loadings.   Alright, so now, in this particular specification can actually work, you know, perfectly fine if we have, for example, yearly assessments of anxiety.   But notice here what I'm emphasizing is that there's equal spacing between the time points, right, and that's important because, in order for this to really be a linear growth curve, there needs to be equal spacing here.   But obviously this could be weekly assessments or they could be assessments that are taken every month and that's fine. This is going to work out great.   Now, it could be that you don't have equal spacing and that can also be handled fine in SEM as long as everybody has the assessment at the same time point.   So here's an example where there's one month spacing between the first measure of anxiety and the second one, but then from the second to the third, there were   two months, and so what we have to do is fix the loading of that last...the slope loading here, instead of two, it has to be now fixed to three, right, in order to capture...and notice from one, we jump from one to three and that's what assures us that we still have a linear trajectory here.   Alright.   So it's time for the demo, and what I want to share with you is some data that come from the COVID-19 Psychological Research Consortium. It's a group of universities that got together and wanted to really start collecting longitudinal data to understand the extent of   the damage really that the pandemic is having on people's mental health and even their physical health. And so we have three waves of data.   And these are from a subsample of the UK, and just like I showed you in that previous slide, the repeated measures are in fact from   March 2020, and then a month later in April, and then two months later in June. And we're going to be looking at repeated measures for anxiety.   The survey for anxiety could vary from...could range the scores from zero to 100, where 100 means higher anxiety.   And then we're also going to look at health complaints over time. Those could range from zero to 28, whereas, you know, higher score for percent more health complaints.   And we're going to look at one time invariant variable which is resilience and this one was assessed at the beginning in March 2020.   Okay, so let's take a look at the data.   So I have the data right here. And notice, we have a unique identifier for each of our individuals, so each row represents a person. Actually,   there's some missing data there that we're not going to worry about right now. But   notice we have some demographic variables and then further to the right here, we have our data on anxiety and those are the repeated measures that we're going to focus on first.   Now I do want to say that initially, you would want to, you know, plot your data with some nice longitudinal graphs,   but we're going to skip straight into the modeling because I want to make sure we have time to show you how to use the SEM platform for these models.   So I'm going to go to analyze, multivariate methods, structural equation models. And I'm going to use those three anxiety variables and I'm going to click on model variables and okay, in order to launch the platform.   So notice that, as a default, we already see a path diagram that is drawn here on the canvas and we can make changes to that diagram in a number of ways.   I usually use the the left list, the from and to list, where we can select the nodes in the diagram and we can link them with one-headed arrows or two-headed arrows, right. I can just show you here, so by selecting them, we can make some changes here.   And I can click reset here on the action buttons, in order to get us back to that initial...initial model, and we can also add latent variables by selecting our observed variables in this tool list and then also adding latent variable here with that plus button.   So nice thing for us today...and I'm sorry about my dog is barking in the background, but we probably have some mail being delivered.   But the nice thing today for us is that we have this really useful model shortcut menu. And if we click on here, we're going to see that there's a longitudinal analysis menu with a lot of different options for growth curves.   So let's start with the intercept only latent growth curve. And here the model that's being specified for us is one where each of our anxiety measures is only specified to load onto an intercept factor.   And so this is one of those models where there's only a flat line, but we have a variance on the intercept acknowledging that individuals have flat lines, but they could have different intercepts for them.   Now we don't know if this model is going to fit the data well. In many instances, it won't because it's a no growth model, and   nevertheless, it's actually quite useful to fit this model as a baseline so that we can compare our other models against this one, right. And we do   label the model no growth as a default here when you use that shortcut. So I'm going to click run and very quickly, we can see the the output here.   There's two fit indices are really important for SEM. These are over here. The CFI is something that we want to have as close as possible to one, and you can see here, this is...   this is pretty low. Usually you want to have .9 or higher,   at the least. And RMSEA, we want it to be a most .1. We really wanted to be as close to zero as possible.   And so, this is very high, and so, not surprisingly, it's a poor fitting model, so we're not even going to look at the estimates from it, because we know it doesn't fit very well.   But we're going to leave it there because it's a good baseline to have in order to compare against. So going back to the model shortcuts, we could look at the linear growth curve model.   And when I click that, I automatically get that slope factor added and notice that   the factor loadings are there, and as a default, we just fix them to zero, one and two. Now the way in which this   shortcut works, is that it assumes that your repeated measures are in the platform in ascending order. It's really important, because if they're not, then these factor loadings are not going to be   specified like...they're not going to be fixed to the proper values.   In fact, here you can see that June is fixed to two, but I know that there's two months in between April and June and so I'm actually going to have to come in here and make the change by selecting this   loading and clicking on fix to, and I'm going to fix it to three, because I know that that's what I need to have to really have that linear growth curve.   And so that's it. We're ready to fit the model and so I'm going to click run.   And notice what a great improvement in the fit indices we have, right. The CFI is nearly perfect and the RMSEA is definitely less than .1, so this is a very good fitting model and we can now   look at the parameter estimates to try and understand what are the trajectories of anxiety.   The first bar we can see is the means of the intercept and the slope.   They are statistically significant and they tell us the overall trajectory in the data, so on average individuals in March started with an intercept of 60, about 67 units,   and over time on average, they're decreasing by about five and a half units every month. Because of the way that the slope factor loadings are coded, we know that this estimate represents the amount of change from one month to the next.   Some of the very interesting estimates in this model are the variability of the intercept and the slope.   And notice they're also substantial in this model, which basically means that, yeah, we have that average trajectory, but not everybody follows that trajectory.   That means that some individuals can be increasing, while others are decreasing and others might be staying flat. And so a natural question at this point can be,   you know, what are the factors that help us distinguish between those different patterns of change? And that is a question that can be really   easy to tackle in this framework and we're going to do that by bringing in factors that predict intercept and slope.   So on the red triangle menu, I can click on add manifest variables, and let's take a look at resilience as a predictor.   So I'm going to click OK, and by default, resilience has a variance and a mean and that's okay, because I want to acknowledge has a non zero mean and variance,.   but I want it to be a predictor, so I'm going to select in the from list, and I'm going to select intercept and slope in the to list.   And we're going to add a one-headed arrow to link them together and have the regression estimates, so we can understand whether resilience explains differences in how people are changing.   And so I'm just going to click run here, and we see that this is, in fact, a very good fitting model.   And it has some really interesting results, because it shows that the estimate of   resilience predicting the intercept, that initial value of anxiety is, in fact, statistically significant and negative. And it can be interpreted as any   standardized regression coefficient, meaning that, for every unit increasing resilience, this is how much we should expect the intercept in anxiety to change, right. So the more resilient you were in March,   the more likely you are to have lower score for your intercept in anxiety in March, so that's really interesting, but then again resilience in this model does not seem to have an effect on how you're changing over time.   Okay, well, that's really interesting, but I really want to get to the idea of fitting multivariate models in SEM, so let's go back to the data.   And I've already specified ahead of time...I saved a script that models again, just a linear univariate model of health complaints over time.   So we have an intercept and we have a slope and I fit this model, you can see it fits very well as well, and so we can look individually at both   anxiety and health complaints over time. And that is often times a good way to start to look at the univariate models first.   And so here health complaints, as a reminder, could range from zero to 28, and we can see that the trajectory according to the means here, average trajectory   is described by an overall intercept of about four and it has increases over time of about .3 units.   And in this case, there seems to be significant variability in the intercept and not so much...not not for the slope, so people are generally changing in the same way. Overall, individuals seem to be increasing by .3 units every month in their health complaints.   Okay, so now let's use this red triangle menu, and once again we're going to click add manifest variables, but what we're going to add are all three repeated measures for anxiety.   So I'm going to click OK, and as a default, we're going to put the means and variances of anxiety, but I don't want the means of anxiety to be freely estimated.   What I really want is for the means to be structured through the intercepts and slope factors. So I have to select those edges, and I'm going to remove them so that   instead, what I'm going to start building interactively here is a linear growth curve that looks just like this one, but for anxiety.   So I'm going to start by selecting all the three measures here, and I'm going to name this latent variable intercept of anxiety. I'm going to click plus.   And now there's the intercept factor but notice as a default, we will fix the first loading to one for any latent variable.   But because we want this to take on the meaning of an intercept, we actually want to fix these two loadings to one. I'm going to click here,   fix those to one, and now we have to add the slope. So I select all three of them, and I'm going to say slope of anxiety. I'm going to click plus.   Now that slope is over here. Again as a default, we fix this first loading to one, but I know that I want to code this in a way that that first   factor loading is zero, so I'm simply going to select that factor loading and I'm going to click delete to get rid of it, because that's the same as fixing it to zero, and then I'm going to fix this loading to one.   And that last loading needs to be three, in order to have that linear growth.   Now we're almost done. Remember that the most interesting question that we'll be able to answer in this bivariate model is   to look at the association of growth factors across processes. So we're going to select all of these nodes in the from and to list and we're going to link them with double-headed arrows. Those are going to represent the   covariances across all of these factors, and the last thing we need is to add   the means of intercept and slope for anxiety. So we're going to click over here, and that's it. We're ready to fit our bivariate model. I'm going to click run.   And notice it runs very quickly. The model fits really, really well, and these mean estimates, once again, describe the trajectories for each of the two processes. I'm going to hide them, for now, so that we can interpret some of the other estimates with a little more ease.   I think there's some really interesting findings here. You can see these values are in a covariance matrix, so   we could actually change this to show the standardized estimates, just so that we can interpret these covariances in a relation metric.   But what's really interesting is to see that there are positive significant associations between   the intercept, that is, the the baseline starting values of individuals in their health complaints and how they're changing in their anxiety over time.   In other words, the higher your intercept is, your initial value of health complaints, the more likely you are to have higher rates of change and anxiety. And we also see that positive association between the baseline values in health complaints and anxiety.   And there's another positive association here that's really interesting, because this is a positive association between rates of change.   So the more you're changing in health complaints, the more likely you are to be changing in your anxiety. So if you're increasing in one, you're increasing on the other, so that's really insightful. What again...   we can still come back and add a little more complexity by trying to understand the different patterns of change in this model, so we can go to add manifest variables and look at how resilience   impacts all of those growth factors. So I simply add it as a predictor here very quickly. The models do start to get a little cluttered, so we're going to have to move things around to make them look a little better, but this is ready to run.   It runs very quickly. It fits really well and we could, you know, we could hide some of these edges, like we can hide the means and   even the covariances for now, just so that it's easier to interpret these regression.   effects. And so you can see that resilience has a negative association with both   health complaints and anxiety at the first occasion. In other words, the more resilient you are in March, the more likely you are to have lower values in the health complaints and in anxiety, so that's really cool.   And we also see here that for the rates of change, in the case of anxiety, the rate of change is not significant, the prediction isn't,   but it is significant -- this line really should be a solid, because you can see that there is a significant association...negative association between resilience and the rate of change in health complaints, such that the more resilient you are, the   more likely you are to be decreasing in health complaints over time. That's really interesting, especially when you tie a   well-being or mental health aspect, like resilience, into something more physical, right, like that health complaints.   Alright, so we're running out of time, but the very last thing I want to show you here, just because I really want to show you the extent to which SEM is so flexible and can answer all sorts of interesting questions.   I actually fit a model that is a bit more complex, where I'm looking at three different predictors of all of those growth factors.   And I also brought in measures of loneliness and depression in June at the last occasion. And what I did here, again I left this with all the edges, just so that you could really see the full specification of the model.   But I can hide some of the edges, just to make it easier to understand what's happening here. What I did is I added   loneliness and depression, and I'm trying to understand how the patterns of growth are predicting those outcomes, alright. So here you see those regressions.   And we're also adding some interesting predictors like the individual's age, the number of children in the household, in addition to to resilience, as we saw before.   And I could spend a long time just really unpacking all of the interesting results that are here.   Without a doubt, you see, solid lines represent significant effects, so you can see that your patterns of growth and health complaints significantly predict depression   at that last month in June. So that's, to me... I find that fascinating and you can also see how resilience in this case has a number of different significant...   number of different significant effects on how people are changing over time. Here is is an interesting effect,   where for every unit increase in resilience, we expect the rate of change in health complaints to decrease by .02 units, so it's a small effect but it's still a significant effect, so it's really interesting.   And there's a number of things that you could explore just by looking at the output options.   At the very bottom here, I included the R squares for all of our outcomes and you can see we're not explaining that much variance in the intercepts and slopeo factors here, so that means that there's still a lot more that we can learn by bringing additional predictors to this model.   Okay, so let's go back to our slides, and   I want to make sure that we summarize all the great things that we can achieve with these models.   You can see that growth curve models allow us to understand the overall trajectory and individual trajectories of change over time.   They allow us to identify key predictors that distinguish between different patterns of change in the data   and allow to examine effects that those growth factors have on outcomes. And when it comes to multivariate models, it's really nice to see how how change processes...   changes in a process can be associated to changes in a different process.   Now it's important that we remember in our illustration that the data were observational, so we cannot make causal inferences, and also, we were using manifest variables for anxiety, but anxiety is an unobservable   construct, so really just be aware that if you really wanted to, we, and if we had experimental data, we could use experimental data so that we could make causal inferences and we could have also specified latent variables for anxiety,   such that we had more precision on our anxiety scores.   Alright, so I think, even though we cannot make causal inferences, it's pretty fair to say that resilience appears to be a key ingredient for well-being, and so I want to make sure that this is the take home message   today, because I think as the months continue to pass during this pandemic, we all need to find ways in which we can foster our resilience, so that we can, you know, deal with whatever comes as   well as we can. And so with that, I want to make sure that you have some references in case you want to learn more about longitudinal modeling and I thank you for your time.
Ryan Parker, Sr Research Statistician Developer, JMP   In JMP Pro 16, the Direct Functional PCA (DFPCA) modeling option has been added to the Functional Data Explorer (FDE) to provide a way to perform functional PCA without first fitting basis function models. This approach not only makes larger functional data more tractable, but it also provides a more hands-off approach to analyzing functional data. This presentation details how DFPCA works and presents examples that highlight how and when to use DFPCA to analyze functional data.     Auto-generated transcript...   Speaker Transcript Brendan Leary you're right. Ryan Parker, JMP you're.   here.   I'm good.   You can hear me OK. Brendan Leary I can hear you Okay, you can hear me okay.   yep all right good you got the audio good nice to meet you.   Obviously, doing this, you work for Chris or.   awesome I'm. Ryan Parker, JMP Chief data scientist right, so I have that right. Brendan Leary that's a big title.   right there.   yeah. Brendan Leary And you're a senior research statistician I look around, and I think I've seen your face around but I haven't had the chance to meet you so nice to meet you. Ryan Parker, JMP Too So where are you. Brendan Leary Based out of New Jersey I'm going to take you for New York and New Jersey project.   awesome. Brendan Leary I have fun in the field and I get to show off all the good stuff that you guys build so.   I think you're perfect.   So, as we get started here just a couple of things when I'm.   let's see first things first your microphone looks good if all your cell phone computer notifications all that jazz turned off. Ryan Parker, JMP Like to.   Be do not disturb yeah let me double check the computer.   yeah do not disturb on everything. Brendan Leary Okay, I don't hear anything like if you have a fan going or anything. Ryan Parker, JMP I do, but you can't hear it okay.   No. Brendan Leary You have any pets even put away so no no. Ryan Parker, JMP that's just.   Just little kids, but they will not be around. Brendan Leary Had a dog incident, the last one I did that's all I'm gonna say.   All right, alright let's see your display um are you on windows or MAC.   MAC well do you know what your display settings it's probably fine if it's on a MAC usually they asked for 1920 by 10 at. Ryan Parker, JMP The. Brendan Leary Just for a resolution, but I imagine it's fine. Ryan Parker, JMP yeah should be OK OK, I. Brendan Leary want to show what you're going to use it just to JMP journal, or is it a PowerPoint as well. Ryan Parker, JMP So I'll just share the desktop is that the best way to do it. Brendan Leary yeah. Ryan Parker, JMP Or do you like to share.   One at a time it's probably the easiest way to hide zoom.   So let's start with this see how this looks. Brendan Leary Great. Ryan Parker, JMP Okay, this slide film all real quick just to make sure nothing weird pops up. Brendan Leary You know funky animation you're good. Ryan Parker, JMP yeah no weird animations and then I'll just go to JMP and openness state of setup and.   And a launch from there. Brendan Leary Perfect Okay, I look I I think we're good, am I said, the only other thing I bring up to me, is before you JMP in and you start we're going to record it to detail is you know.   So if you make a mistake, or you want to would have to start over.   But just give a couple of seconds before you start maybe count down from three or five in your head.   Okay, a little bit of white space at the beginning, in the end, so you can flip it. Ryan Parker, JMP Right sure okay. Brendan Leary And that that's all I got so I'm going to go ahead and hop on mute and, lastly, I work for JMPs I don't think I have to but I'm going to read it just to be safe.   You understand that this recording is for use by JMP discovery summer conference and will be publicly available in the JMP user community do you give permission for this recording in us. Ryan Parker, JMP I do please share it. Brendan Leary widely reported, just in case just perkinson can't get mad at me I asked. Ryan Parker, JMP yeah you know somebody spouses like hey what you did what. Brendan Leary You know I know I know.   cool Ryan, thank you I'm going to go on our video hop on mute countdown in your head and when you're ready go ahead and start. Ryan Parker, JMP Okay, and just so you know, I think that time myself to be right about 20 minutes.   Because it's because I think it's a total 30 right, and then they wanted to give some.   extra time. Brendan Leary So yeah brevity is perfect um you given time for Q amp a that that sounds ideal. Ryan Parker, JMP Okay, great.   sounds great.   Well, thank you for coming today. My name is Ryan Parker. I'm a   direct functional PCA.   And it's a new way of analyzing functional data. And I just want to acknowledge our chief data scientist Chris Gotwalt has played a major role in not only development of FDE but   also this new tool, as well as our test team of Rajneesh, whose, you know, work usually goes unnoticed, but is a big part of why we are where we are today.   And so I'm not going to assume that you even used FDE or that maybe you don't even know what functional data is.   So just kind of start off at the beginning to make sure we're all on the same page. Functional data...really we think of it as anything that has an input X, in this case we have some temperature data. Our input is the week, so these are measured   every week of the year. And our output Y is this temperature.   And so, you know, you could sample to find a resolution, maybe every day, or every hour, but the the general idea is you haven't completely sampled, you know,   the whole function. You've got some way to work with some sampling of it. And really in all the cases   that we care about, although you can use one function, we also want to think about having multiple functions. So we've got multiple weather stations where every week, we've captured the temperature.   So, although we have every week filled in here, we don't necessarily have to have that feature of FDEs, as you can have, you know, some months sample points or they don't...they can be sampled at different time locations.   But it also doesn't have to be, you know...your input doesn't have to be time in the traditional sense. Maybe it's temperature and you've got this clarity of measurement as your output.   You know, so we support that.   it's really this mapping of you have an input, you have an output. Or maybe you're in a spectral setting and you have all your input is the wavelength, and now your output is the intensity.   In an example I'll show to illustrate direct functional PCA, we have multiple streams of data, multiple functions, so not only do we have, you know,   a different...different function but we've got different outputs. We've got, you know, a charge, piston force, the voltage and a vacuum and we want to bring all of these together   and analyze them.   And so, these kind of types of data help motivate the development of functional data explorer.   And there are two primary questions that we...   we usually use it to answer. The first is in a functional design of experiments setting. In this setting, we have factors that we want to try and relate   to our response function. So here in this case, we're really interested in how do we get this response function to be shaped a certain way.   So we can link up these factors to the function and, in this case, we wanted to...our function to remain in the green specification area for as long as we can, and with FDE, we can...we can help do that.   The other common case is in what we call functional machine learning and, in this case, we want to think more of our functions as being   inputs to something. So in our process, maybe we have the final results or a final yield, in the case of this fermentation process data.   So we want to summarize the shapes of these functions. Use them as inputs to a predictive model to help figure out what is going to give me the best result.   And so, really, the big game with this, that you try to play, is it's functional PCA and what functional PCA is doing is it's going to summarize our data.   So I have here a really simple example where it's probably pretty easy to see that they all have different pairings slopes.   And it may be a little harder to see that the means are kind of a little different for each one. So our goal is to motivate this is to try and summarize these, you know.   Use a simple case to how can we expand this to more complex situations.   And so, when we do ??? composition, we'll get orthogonal eigen functions that are going to explain as much of the function to function variation as possible.   And, once we have those, we can use them to extract summaries from all these functions that we   can use in predictive modeling. So we take, you know, some really complex shape, in this case it's not super complex but we're going to summarize it to a couple of data points from, you know, the 10 or so that we have here.   This is...this is what an eigen function looks like   for these data. So we pulled out two. The first is explaining around 77% of the variation and it's giving the most weight to the very beginning and to the very end. And   this is, as you might expect, summarizing that, you know, it's quantifying the slope of these these data. Whereas the second eigen function is giving equal weight over the...over the whole input and that's that's quantifying that difference in the mean.   So now we want to turn these Eigen functions, use them with our data and get a quick summary that we want to use to explain the differences between these functions. So   taking Function 2 as an example, we multiply this function times this first Eigen function. We're going to get a function that has a lot of negative numbers.   So if you kind of think about, you know, taking the integral of this, just kind of adding up all those negative numbers, you're going to get a fairly large negative number.   So we can see how this first component is really capturing the differences. So number 10, right, was a large positive number, and so we can...we can kind of see back in our data that it also had that kind of large positive slope.   And a similar idea for the second component, where we have, you know, higher versus lower overall averages.   And, once we have these two things we can then go back to our original function.   Adding in an overall mean, we just take the first functional principal components score, looking here at Function 1. We multiply it by the first eigen function   and then we add in the second functional principal components score, multiply by the second eigen function, and   that's going to recreate this first one. In a similar process, using the scores with the second function allows us to   reconstruct or approximate those functions. So you can kind of see where, okay, if we are able to build models for these FPC scores,   we can, you know, understand how DOE factors change them, in which case that's changing the shape of our alpha function and sort of that first scenario we looked at.   So let's let's kind of go into why direct functional PCA? Again what what motivated to do this? So if you have used FDE before,   the modeling options we have are considered basis function models. And in these cases, they're really...they're smoothing the data first, so we're going to fit the model, we're going to get a smooth function that we can operate with. But part of the problem is we may have a lot of   a lot of things that we have to kind of tune. So in a case of B spline, we have to pick you know what's the best,   you know, degree of the polynomial to use in the spline, or how many number of knots should we have? And we kind of give you some defaults for that, but, you know, we also allow you to change the locations of those knots. And this is great and works for a lot of cases,   but as you get larger sample sizes or more complex functions, this could take a lot of time and it may be on track to really tune all the locations of these knots.   But the way FDE works now is we fit the model, we'll perform FPCA on the coefficients of those models and there's this nice relationship between,   you know, now our Eigen functions are in the same form as the model of our data. So there's a lot of nice things that come with it, but whether it be costly computing or just the models don't do it well, we needed another approach.   And so   the previous approach is smoothing the data first, and now we're thinking about, okay, let's just take the data as they are. Let's operate directly on that.   And then from there let's smooth the eigen functions that we get. So this isn't a fair, you know apples to apples comparison, but with with B splines compared to direct FPCA, you'll...you'll tend to notice that the   eigen functions you get are a little smoother with direct FPCA and that's by nature of the way we wanted it to be a little smoother, you know. Is this little artifact here really that important.   The FPCA says, maybe it isn't. I think it's really captured in the last eigen function, where it doesn't explain a lot of the variation. We're kind of getting some weird bumps here.   You know, an expert maybe ought to analyze this and say, no that's actually real, but most likely, maybe we shouldn't really be giving a lot of weight to this eigen function.   Probably not using it.   And so the algorithm we use to fit direct functional PCA model is similar in spirit to this Rice and Silverman method, mainly that it's it's an iterative process. So what we do is, you know...   first I should mention that the data needs to be on a regular grid. So if you do not have your data on a regular grid, we will interpolate it directly for you. We also have some options...data processing step options called reduce that you can   apply to kind of finely control the grid that we operate on.   But in our procedure, we'll take one eigen function.   We can just ask for the first first component.   Fit a very, you know, fit a smoothing model to that and then ask for the next one.   And once we've smoothed the next one, we need to make some adjustments so that we get you know the orthogonality properties that we want.   But the idea here is that we're...we've taken a problem where we're trying to smooth a lot of different functions to first get models to work with, where now we're going to   really focus our efforts on smoothing these Eigen functions one at a time. And we have a much smaller number of them, so this makes this technique much faster for large data sets than existing solutions in JMP.   So the example I'm going to go over is a in a manufacturing process and just to kind of give you an idea of speed ups,   if you, sort of, best case scenario, you knew the exact P spline model that you wanted to use for these data, it would take you three times as long to fit those models than the out of the box direct FPCA solution.   Where the difference is, you don't necessarily know what those those models are going to be so you have to take time to fit multiple models and so now you've really taken a lot of time relative to what direct FPCA is going to do.   But, so this example   looks specifically at a step where we're bonding glass to a wafer.   And this process, you know, there's like a vacuum surrounding it, there's some tools, it's all sitting on this chuck, and   this process runs and, unfortunately, about 10% of them get destroyed, but this is just sort of in the middle of process and you don't get to know until weeks later   that they were destroyed. So our goal is...we have sensors that are collecting data through this process, we want to try to use that to identify wafers that we can sort of get rid of   early. Maybe there's a subset that we can get rid of, so we can not spend any more money on them, and so the goal is to try and identify that using our sensor stream data.   So now I'll go to JMP.   And I have a journal, so all these sources will be available on the Community page but I'll open up this data set   And I'll launch   functional data explorer from the analyze specialized modeling menu.   So let's go through each of these columns. So we have the wafer ID, so this is just sort of our...   groups our different functions.   The condition, so this is, you know, was it good or bad? So we want to   keep that, we want to use that later so we'll just put it as a supplementary variable and that tells FDE, you know, when we save things in the future, go ahead and bring along that with it.   charge, the flow, piston force, vacuum and this voltage   as a part of this process.   To launch FDE,   we just kind of scroll through and see you know I showed these earlier, but these are all the different types of data we have and   necessarily...it may be possible that not every model would fit, you know. The same model that fits this maybe it doesn't fit, you know...same model that fits charge maybe doesn't fit flow as well.   But direct FPCA sort of is looking at them individually and it's kind of handling that for us. Before I before I go to that, I want to show this reduce option that I mentioned. So by default...   well there are three tabs here. You can directly put it on an evenly spaced grid, you can bin into observations, or you can remove every   nth variable to kind of fill it out. These data aren on a grid. We don't really have to do anything, but we could, you know, say all right, let's do this and, by default, it gives you half of the original data set.   So now you've you've taken it down and the shapes are still fairly the same. In this case we don't have to do it, it's still fast, but if your data are either   not already on a grid, or you just have a lot of it, by using reduce you're still able to capture the key features. It's really something worth exploring.   So, since we have multiple ???, I'll go to this FDE group option   and launch direct functional PCA.   And so, this has taken a few seconds, but it's for each one fitting a model.   charge. So we're able to fit, works reasonably well, we have diagnostics available.   We have a model selection option to let you know, change the number of FPC scores. We've identified as four as being best for this particular case and most of others that actually picks just one.   But if you have used FDE before, you know, you'll see this familiar functional summaries but there's nothing...we don't have anything else. There's no prior model right. This is kind of this...   functional PCA is the model and we're focused entirely on that, instead of other things that we necessarily had to tune before   on my your score plots and profilers. So to scroll through these, we can see that we are you know seemingly fitting these fairly well.   Piston force...and I said that, you know, a lot of these end up just picking one. I mean it's saying this is explaining almost all all the variation in that case.   Okay.   Good. Again, voltage, so last one.   So if we go back to this group option we can save the summaries for all of these functions. So now we've effectively, fairly quickly used FDE to   load our data and take all of those functions and summarize them down into just a few summaries for each one.   But the main ones we   are most interested in, primarily because they're just summarizing the variation in the shapes, or these FPC scores.   But there is still information in things that people, you know, pre FDE, what would they do? You take the mean, or you look at standard deviations or other summaries, and these things still have value, so I think it's good, you know, we, by default, bring those along.   We have some scripts. So every script you had in your original data table is going to be brought along as well, but we also add these profiler scripts so you can launch those and see, okay as I'm changing my FPC scores, what does that mean?   And help build build some insight into that.   And so what we want to do now is we're trying to predict that condition, good or bad. So I'm going to use generalized regression just because I think it's, you know, it's a really good   method for not only fitting these models, but also interpreting them.   But you could really use anything. You could get a neural network, you could do any other thing. Once you've...once you've saved this table, you're free to use the rest of JMP   for how you feel like you could model it best. So I'll take all of these summaries and I'm going to just do a factorial tp degree two.   And so we're trying to predict the final condition and we'll use the validation data set.   And we're targeting as it as, was it bad or not?   And so, really, this is probably the longest computing section of this. We'll do a Lasso by default using this validation column. I mean it takes, I   think, around you know 5-10 seconds.   You can always stop it early if you want to, but you know we're we're giving it quite a good set of features and looking at interactions between them to try and figure out,   you know, the best way to try and predict this condition.   The...   almost done.   It's just building the report now, and and once you get this model, you''ll also see things that it felt like didn't map.   So I'm personally a fan of looking at the parameters on a centered and scaled basis to, you know, to help better understand magnitude differences.   But some of these, you know, like this FPC score, I thought, you know, it's really not informative whereas FPC 2 and 3 for the charge   are more helpful, so you can kind of sort them and see, you know. And kind of like I described, things like   simple summaries of means or minimums or standard deviations, they're giving information but, but we also see there's definitely a lot of information in the summaries of the...   of the shapes of these functions using these FPC scores. So I'll go on the other side, whereas this you know this flow standard deviation seems to be very important and I think,   you know, another reason why I like this is, if you are in charge of this process, and you have control over the flow standard deviation,   this can help you, you know, maybe you don't need to know what the first step is, well, hey, maybe we have a part of this we can actually   improve before we build a model, to say okay, now let's just discard these. So let's say we've done that, where this is our model, we want to try and just build a heuristic   good or bad. At what point do I just want to go ahead and discard it? So let's save the columns for the prediction formula, and this will be saved back to that summaries table. So GenReg is going to give us probability bad, probability good, most likely condition.   And so we can...we can just stop here and say okay, if it's above .5, let's look closer at it, or if it's above, I don't know, .75, that feels good. Let's...maybe it's not worth it just due to the   cost of the rest of the process. You know, you kind of maybe pick that probability based on the real-world implications.   Or we could let a partition model help us figure that out, so what we're going to use now is this probability as a factor. We're just kind of saying okay,   I could kind of look at this by hand and let's have a model just help me try and figure out, you know, how is it going to group up these conditions. And so we'll take this, maintain our validation data sets so we're not kind of double dipping.   So now all of our blues are the bads and the reds are the goods, so thankfully, in general, most of them are good.   But let's do a split. So if this probability of being bad is less than .1, very likely, it's good. Most of our bads are in this greater than .1. It makes sense. Split again. Now we're really looking at...a lot of them are really in this   is the probability over .25. We'll split one more time, and   at least for the training data, it fits pretty well that, hey, all of them with above .6, they're all bad. You know, you really don't expect that to happen   in reality, and this training or this validation R squared does highlight that, you know, like most models, you can do better on the training set than you can do on the validation data set.   But it's giving us now, you know, I kind of felt like .75 was good, but really this is saying, maybe we really need to focus on these that are .6 or higher.   And so, this is, you know kinda...   I guess now now, we've gotten to the point where you know, these were simulated data. In the real-world case, this was very helpful, things like interpreting, you know, are we...can we improve our process, also helpful, not only in this, but in other sample data sets that we have.   Let's go back to the slides.   Okay, so I kind of went through and summarized what funtional PCA is, kind of, what motivated this new direct FPCA approach and showed you how to use it in an example where we're trying to, you know, discard things early in this manufacturing process.   So really just kind of some final tips is, you know, it...the fast computing makes this great for large data sets. In some ways, you can just start there   and say, okay, what's direct FPCA think? It's so fast, I don't have to you know fiddle with model controls. In some ways, if you have large data, it's a great place to start.   But it's not perfect, you know. I showed some diagnostics and you can see if it's not fitting well and like like any model, just because it was fast doesn't mean it was good. So just kind of make sure that you're   identifying possible issues and maybe you need a different approach. I mean we have, you know, other even basis function models that we're   working on for very particular types of data that, you know, even this approach doesn't necessarily do as good on as, you know, these very specific models.   And that's, you know...part of this is the data must be on a grid and try to use reduce to help you control that grid, if it seems like things are either too slow or don't really, kind of, makes sense, maybe what we're doing by default isn't what's good for your data as a   as what you can do yourself.   Thank you so much for coming and I'll answer any questions if anyone has anything.
Rich Newman, Statistician, Intel Don Kent, Data Analytics and Machine Learning Manager, Intel   We have a set of responses that follow some continuous, unknown distribution with responses that are most likely not independent. We want to determine the simultaneous 95% upper or lower bound for each response. As an example, we may want the lower y1 and y2 and upper y3 and y4 bounds such that 95% of the data is simultaneously above y1 and y2 and below y3 and y4. Finding the 95% bound for each response leads to inaccurate coverage. The solution: a method to calculate the simultaneous 95% upper or lower bound for each response using nearest neighbor principles by writing a JMP script to perform the calculations.     Auto-generated transcript...   Speaker Transcript rich n Hello, my name is Rich Newman, and I'm a statistician at Intel. And today I'll be presenting on a JMP script that determines a simultaneous 95% bound using a K-nearest neighbor approach.   This presentation is co authored by Don Kent, who's also at Intel and both of us are located in Rio Rancho, New Mexico.   I'd like to start today's presentation by motivating the problem, and from there, I'll share some possible solutions and ultimately, land on the solution that we went with.   Along the way, I'll provide some graphs to help further illustrate the points and then finally, I'll show the JMP add-in that we use to solve the problem and some screenshots illustrating the script.   we're designing a device, and we need to know what is the worst-case set of fourr resistance and four capacitance values that we see.   And worst case for us is defined as the low resistance/high capacitance and the high resistance/low capacitance combinations. So for clarity,   we have eight variables and we may need the simultaneous bounds of the four low resistance values, the four high capacitance values, so we can use it to help us design the device. And worst case may be defined as 95% confidence or 99% confidence and ultimately, that's up to the user.   Alright, to illustrate this problem, let's use just one resistance and one capacitance value. So I have resistance on the X axis, capacitance on the Y axis, and we want to know two things. We want to know that yellow star, the worst   low resistance/high capacitance combination, and we want to know that purple pentagon, the high resistance/low capacitance combination.   Alright, with respect to our problem, I want to point out that we recognize that these eight variables, these eight responses are not independent. There's some correlations among them.   Furthermore, each of these responses may or may not follow a normal distribution or the multivariate normal distribution.   We ask these types of questions frequently. So in other words, we do not want to solve this once and be done with it, and never deal with it again. We get asked these questions   often, so we really need a robust solution that's easy to use. In our case, we're very fortunate that we tend to have relatively large data sets, at least 400 points, and we typically have   1,000 points. And for us, a practical solution works as long as it has some statistical methodology behind it. So if I go back to that previous graph, it's not like I'm going to throw a dart on that graph and say, wherever it lands,   it's going to be our worst case bound. You really want a little more meat behind that, but I do want to point out, we don't necessarily have a definition of worst case, whether it's 99% or 95.   And we just know it's better to be a little bit conservative, to make sure we're designing a device that's really going to work and not have any issues in the future.   Okay, I want to share a completely made-up example, just to illustrate that this type of problem can happen in any industry.   And so, imagine we made adjustable desks for the classroom, and we want our desk to work for 99% of the population.   a person's height and a person's width.   Now, when you have JMP, it comes with some sample data, and one of those data sets is called BigClass. And in BigClass, it has   some students in there and their heights and weights. And so we can use that data set to help us determine the height and weight bounds that capture 99% of population.   So if I look at this graph here on the bottom right, we see the points represent...each point represents one student's height and weight combination.   Okay, if I go back to our problem,   our current approach, which we believe can be improved, is we independently find 95% or 99% prediction bounds for resistance and capacitance.   And in this example where I'm just looking at one resistance and one capacitance, we would find two separate bounds.   So as an example, we would find the 95% prediction bound for the resistance, which is 4.52 and 4.97 and that's designated by the darker blue lines.   And then we would find 95% prediction balance for capacitance, which is 15.6 and 16.5, which is designated by that greenish blue lines, and then we find the combinations to get us that yellow star and purple pentagon, which gets our worst case.   Now we have a little bit of concerns with this approach. And the first concern is around a Type I error rate. So when I find a 95% bound for resistance 1 and 95% bound for resistance...   for capacitance 1, keep in mind, I have eight variables, overall I know my confidence levels, not 95%. And what it is, will depend on the correlation the variables.   Now we can get over this hurdle by making an alpha adjustment, but there's another hurdle that I wanted to discuss that that has a bit of a bigger concern for us.   All right, what if we were interested in the high/high combination, which is designated by this yellow circle?   In this particular example, you can see we don't have any data near this worst case bound, so if we were to use this, this is extremely conservative.   And when we go to design our device it's going to have a cost and a time element associated with it, so we want to be a little bit conservative, but we wouldn't want to be so conservative that we would use this yellow circle because we're really not getting any data points around it.   Okay, there are some alternative approaches that are easily done in JMP that we wanted to consider as solutions to this problem, and the first one is density ellipses.   And this is found in the fit Y by X platform, so if I hit that red triangle on bivariate fit, I can choose density ellipse, and in this case, I chose a 95% ellipse. And I get that that red ellipse on the graph.   Well, when JMP provides this density ellipse, if you look at the bottom right hand corner of the presentation, it presents the mean standard deviation and correlation of the two variables.   And what JMP does not provide is the equation of the ellipse. Now, this is a hurdle we can we can get over. We just have to do some math to be able to solve it.   But the bigger hurdle is what happens if you have more than two variables?. In this case, JMP doesn't have an easy option   for us to solve this problem. So we could do pairwise ellipses or just do two variables at a time, but we're going to have the same alpha problem and it's going to be pretty difficult to pick out what points we want to use as our ???.   Now there's also one other very minor concern in this approach is what if we were interested in that high/high corner or in that yellow circle? What point on the ellipse do we choose?   And again, I think that's a hurdle we can go to...get over, but the two...when we have more than two variables that's a hurdle that's pretty tricky. We're not sure it is easily solved.   All right, there's another approach that we can...we can...that's easily done in JMP and that's principal components.   And what principal components does is it creates new variables, and JMP will label them Prin 1 and Prin 2,   such that the new variables are orthogonal to each other. And the fact that they're orthogonal we can use to help us solve what's our worst case bounds.   And this is found in the in the multivariant methods platform, so if you go to analyze, multivariate methods, principal components, we can ask for these principal components.   Now the concern with the principal components approach is that the math is extremely difficult when there are more than two variables.   Furthermore, in theory, principal component tries to reduce the dimensionality so, in other words, if I had eight variables that I wanted to try to find this worst case simultaneous...simultaneous bounds on,   but JMP may come back and say, okay, we found three main principals that that really help explain what's going on.   And that case we have three equations and we have these eight unknowns and it really puts us in a difficult place to solve the problem. And so, for that reason we wouldn't use this principal components approach.   So where does that leave us? So our goal is to find the simultaneous worst case bound, that high/high, low/high or low/low/high/high combination.   We want to use...we'd like to use JMP to help us solve the problem. It has to be able to handle three or more variables.   Each variable may or may not be normal. We expect some correlations. The good news is we tend to have relatively large data sets.   We want to make sure that if we asked for this corner, if you will, that there's data around there and we're not stuck in a situation where there's no data. And again, an easy practical solution may be sufficient.   Okay, so what I want to do now is explain a concept, and then I'm going to show you how that concept is used to solve our problem.   So there's a concept out there called the K-nearest neighbors approach and the idea is, you have a point and you find a distance from that point to every other point.   And then you determine the point's k-nearest neighbors, or the k points for the shortest distances. So to make sense out of this, let's look at an example. So let's focus on that point that's highlighted in red...that dark red and it's coordinates are   4.51 and 15.66. If I take that point in blue, coordinates are 4.67 and 15.54, I can find the distance between those points.   Or the idea of the nearest neighbors is for that point in red, I can find the distance from that point at every other point in the data set.   And then I can sort the distances from smallest to largest and then I can pluck off the ones that I want.   So, for example, if I wanted to know when k=2, two points that the nearest neighbors to the one in the red, I see they're the two points in pink and they're the ones that are the closest to the red one.   Okay, so what I want to do now is talk through the solution and then...at a high level, and then I'm going to slow down and and walk through the steps. So our solution was based on, if you have a large data set,   we can first find the median +/-3 standard deviations for each variable (and I want to point out, you could also use the mmean),   and in doing so, we define what we call our targeted corners, or our desired corners, and that's the lower the highest...based on the lower the high point of each variable.   And then, what we're going to do is we're going to find the distance from each point to that targeted corner, we'll sort the distances from smallest to largest,   and, in our case for our needs (and I'll explain this a little bit more in an upcoming slide) we take the k-nearest neighbors to the targeted corner. And in general, you can collect k-neighbors that represent desired confidence. And again I'll explain this a little bit more in just a second.   Then what we do is, we take the average of the k-neighbors and that becomes our solution.   Okay, so here's the idea is we start with our two variables, we find the mean +/-...excuse me, the median +/-3 standard deviations and we can also use the mean.   And in doing so, we call those our targeted corners, so if I was interested in the low/high corner, the high.low corner that yellow star in the Pentagon they start as what we call her defined targeted corners.   The next step is we take all the data points and we find the distance from every point in the data set to those targeted corners.   And once we get those distances we sort them from smallest to largest.   And then, in this particular example, we find the k-nearest neighbors closest to the targeted corners. So just as an ilustration, you can see the five points in yellow   are the five points closest to the yellow star, and the five points in purple are the ones that are closest to the purple pentagon.   Then what we do is, we take the average of those five points, respectively, and in doing so, these these black ellipses represent what we would use as our worst case value.   And then we will use that to help us design our device.   Now I want to show you those points relative to the density ellipse and relative to these targeted corners, and the yellow star and the purple pentagon is what our original method was.   And we can see the density ellipses aren't bad, they get a little bit better. And what's really nice about this particular solution is we have data points near them and that's exactly as it is designed.   Furthermore, it's not too conservative for us, so we don't have to pay this extra cost, if you will, when we design the device and we didn't have to worry about the distributions of the data. The correlations are really not concerned in how we solve our problem.   All right, let's say you were interested in the high/high and low/low bounds, that light blue pentagon or in the darker blue star. This method works as well.   And what you see in the black ellipses, our solution, is that again we have data points here.   And so, for us, this is a wonderful approach because, especially relative to our current approach, it's not too conservative. It may be a little bit conservative but it's not as conservative as this pentagon and the star.   Okay earlier, I made some comments about how we approach it and there's some choices, so so I want to discuss now the choice of K and should we average?   And to me that K may be based on your confidence level, your sample size and kind of your philosophy, and let me explain.   As an example, let's say I had 1,000 data points and I wanted to be 95% confident.   In that case, I can take the 25th closest distance for the two corners and that would be 2.5% out on the low side or the low/high side and 22.5% out on the other side and together that I've captured my 95% confidence.   So I could just take the 25th closest distance and be done. That's one approach. I can also take the 23rd, 24th, 25th, 26th, and 27th distances, and average them.   And take the average of five values and use that as an approach. So there's a couple different ways you can handle it.   In our particular needs, again we have very large sample sizes and we want to be a little conservative and we're not driven by 95% or 99% confidences. So just as illustration purposes,   those orange circles on the graph on the right, they may represent, as an example, the 95% confidence interval and it may just be the average of five points or maybe that 25th closest distance, as just an example.   And what we would do is instead of using that approach, we would actually take the average of the first 25 points, and in doing so we'd end up with the black ellipses and you can see they're moving out and it's making it a little more conservative.   And so we do that by design, a little more conservative.   And again it's a it's a choice, and for us what's nice is it's not as conservative as those desired corners, our current approach, so so we get a little conservative nature in there, but we're not grossly conservative.   Alright. So what are the pros and cons of this approach? The positives are we do not need to know the distribution of the variables.   We can easily handle some correlation or dependent variables, we can easily handle multiple variables, especially more than two.   We know there's data close to that solution and part of that's dependent on that large data size and we can build a script in an add-in in JMP to easily perform the calculations.   The negative is that it does require a decent-sized data set, because if you want that 99% confidence level as an example, a real high confidence levels, you really need lots of data.   All right, so this is what our add-in looks like, this user interface, and so you have the possible variables in the upper left hand corner.   And then on the high side you enter in, for example, we want the the high values of the resistance and that's in that green highlighting. If I go to the purple highlighting, we can add in some values on the low side, and so in this example, we'd say we want the low combination of capacitance.   Then the next thing you have to do is enter in your confidence that you want.   And we have a recall button, which is nice for convenience for people.   And then we have our team logo, which makes the item look nice and professional.   Alright, once you run this add-in, it will trigger the scripts and the output will be a JMP table.   And in this JMP table, the first thing I want to point out in this highlighting, in this green highlighting, is for all eight variables, we're getting the median, the standard deviation, and whether it was the low side or high side we were interested in.   And so from there we'll build the desired corner, so it would be the median minus three standard deviations for the low side and the medium plus three standard deviations on the high side. And again in purple now, this is our targeted corner.   Then the next thing we do is find the distances for all the points   to that desired corner. And then we're going to find in this example the five points that are closest to that desired corner, and that's that neighborhood values. They're going to be our vector of five values. The neighborhood indices, I'll explain a little bit more on the next slide.   Then in blue, we take the average of those five nearest neighbors and that becomes our solution to the problem. So that column in blue is the worst case values or they are the worst case values that we would use to help us design the device.   Now we also have a column in here called Neighbor Z-score, and what that is, it takes that neighbor average, our solution, and kind of works backwards and sees how many standard deviations away it is from the median.   And the reason why we do that is because our original approach was to take roughly this median plus or minus three standard deviations.   And what we're finding is, to get what we want, we can actually use a much smaller multiplier. So this was just helping us know how conservative or overly conservative our current method is. It's not being used in any calculations, other than just helping us understand.   All right, I mentioned those neighborhood...excuse me, the neighbor indices. So in the upper right I highlighted in purple the 109 and 126.   That corresponds to the rows of the data. So when you run this add-in, you get your JMP table.   It'll tell you what five rows represents the five nearest neighbors,   and also selects them in your original data set. And what's nice about that is, it makes it easy to color code. So earlier, I showed you the example with the yellow, so five yellow and the five purple.   And it's easy once you run this add-in just to change the colors right...right after running the add-in.   Alright, so this is what the output looks like for our eight variables.   And you can see, the green points represent the low resistance/high capacitance values, the red points represent the high resistance/low capacitance values. And I just want to point out in the bottom right part of this graph that I've highlighted in purple,   that you can see, for example, the green and red points, they're not that extreme points for any given variable and that's the simultaneous aspect of this problem.   It's really solving the problem across all eight variables. And so for some variables, that may be extreme, for others it may not be, and that's fine with us.   But again...and in doing so that's helping to understand the relationships between our variables and this again would be a graphical display of what our final solution be, our worst case...our worst case values.   Okay, if I go back to that BigClass data set just to illustrate the add-in, and I want to recognize that this is a small data set. This is just for illustration purposes. I can run the script and ask for the high/high side, I can run it and ask for the low/low side.   In doing so again, like before, I'm going to get the median and the standard deviation. I'm going to use it to find the bound.   And that's my desired corner, then I'm going to find the five neighbors that are closest to that bound. And you can see in green, those are the actual values, and those are plotted on the graph in yellow and purple.   The neighbor indices refer to the rows and that's what allows me to color code my data quickly.   The neighbor average is our yellow star and our purple pentagon, and that would be our solution to the problem. And again we do the Z score just so we have an idea internally for for how conservative this method is.   Okay, at this point, I just want to highlight some some points of our script. So when we built our script, and you can see in the upper right, you know we built this panel box.   And light blue, that's where we ask the user to input variables that are...we're finding on the high side and the variables to be found on the low side.   In that copperish color, that's where we're getting input of the percentile and that's actually a number.   And you can also see on lines 222 and 223, that's what we're building a recall function.   Alright, one of the things that that we like to do is data quality checks, so you know we want this to be as mistakeproof as we possibly can, errorproof.   And so, for example, you have to input that confidence level as a number, and so, then we do a check to make sure it's between zero and 100 or the user gets a error message that they need to change their input.   Likewise on the bottom, we need something in the low list or the high list in order to run this, so we have these data checks to make sure that something's been input.   Alright, the next thing we do is have something called the sumcols funciton.   And what the sumcols function does is it loops through the past columns. So, in other words, it's taking the lows and highs that you've inputted   and returns a dictionary or associative array with some information that's important to us. So if I look on the on the right, the bottom right,   excuse me, if I look on the bottom right, for example in orange it's going to collect information on the data type,   how many data points there were, what's the median, what's the standard deviation. In purple it's going to do that calculation and get us our desired   corner or targeted corner, that we are going to call our bound. And then in blue, that AA, that's our associative array and that's bringing all the information or storing all that information that we need for the next step.   All right, what we do it from from here is we start to calculate those distances. And so you can see in the blue highlight,   and that's on the low side, and if you go right under it, that's on the high side. We start finding distance from each point to that targeted corner.   And we have to do the math, you know, that was shown earlier, where we're squaring and then taking the square root but but in essence we're just finding that distance from the points to three sigma.   And then on the bottom, we're going to go through the process of sorting them, ordering those distances, and then plucking off the ones we need.   And I want to point out in orange highlight that, six, seven and eight.   Whenever we write our scripts, we do our best to have comments. And sometimes when one person is working on a script, unfortunately, they get pulled to something else, and someone else can   finish it, so it's really nice to have these comments so someone else can take over and really understand what's going on.   Furthermore it's nice that these comments, even if you're the only one working on it, that if you have to go back to it, you know, years later, you remember what each step is doing. So we do our best to put in comments.   All right, finally, what we do is, you know, we run through the low and the high side and we return some values in a dictionaryj. And you can see in blue, what we do is we pluck off the minimum neighbors and that's that vector that came out in the output.   Then we take the average of them and that's our solution; that's happening in orange. And just as an example, in purple that we find our Z score. So so once we get the information, all those distances, have sorted them,   we then pluck off the information we need to then build a table that that's our.   Alright, so putting all this together. Our motivating problem is we wanted to find these simultaneous worst case bounds to help us design a device and our current solution is too conservative; it costs us money and time.   And when we go to solve this problem, we know the data may or may not follow the multivariate normal distribution, our data is not independent.   And the frequency of this really requires a simple solution, and preferably with JMP, and so our solution   was to build a JMP add-in that's easy to use, it uses the k-nearest neighbors concept, and the output is easy to understand, and it helps us quickly build those graphs that we can color code, so we can show others.   All right, thank you very much.
Wenzhao Yang, Statistician, Dow Chemical Company   EPDM is a synthetic rubber widely used in applications such as transportation, infrastructure, sports, leisure, and appliance. Dow, as a leading manufacturer of EPDM, continuously innovates in the development of EPDM products and applications to achieve superior properties, including color stability property in automotive weatherstrip. In this Dow case study, the color stability properties of different EPDM rubbers were repeatedly measured over time (repeated measures). The objective of this study is to develop fundamental understanding of EPDM weatherstrip discoloration mechanism and validate hypotheses on EPDM microstructure factors. Efficient DOE strategy and proper statistical models are developed for cause and effect conclusion. We analyzed the data using two methods: linear regression and random coefficient regression. Linear regression completely pools the data by assuming a common variance for all samples across time. Random coefficient regression incorporates the sample-specific effects and provides more inference in variability between samples over time. We identified significant structure effects for color stability property by comparing different methods. In this poster, we demonstrate the power of DOE and statistical modeling for research and fundamental study.     Auto-generated transcript...   Speaker Transcript   Hello everyone, my name is Wenzhao Yang and I'm a statistician at Dow. Today i'm going to talk about statistical DOE and modeling development for repeated measures in rubber research.   Before we talk about the methods, I just want to add a little background for this talk.   EPDM is a synthetic rubber, widely used in applications such as transportation, infrastructure, sports, leisure, and appliance.   Dow is a leading manufacturer of EPDM. We continuously innovate and develop our EPDM products to achieve superior and user properties in various   applications just mentioned. This talk we're focused on EPDM based automotive weather strip application. One of the key performance metric is called color stability property.   Color stability property is measured repeatedly over time on the same experimental unit. it is defined as repeated measures in statistics.   And time dependency may exist between repeated color measures which is known as auto correlation.   A new technical development in our work is we developed a Monte Carlo simulation based DOE strategy for repeated measures to assess the statistical power of detecting active effects prior to the data collection.   Let's move on to the objective and the methods.   The objective in this application is to develop a fundamental understanding of EPDM weather strip discoloration mechanism and validate hypotheses on EPDM polymer macro structure factors for color stability property.   The color stability test experiments follows the industrial rubber manufacturers standards as shown in the figure one.   We start with a list of synthesize EPDM polymers with different micro structures. Then we blend them with other consistent formulation ingredients under the same process condition.   After compounding curing and sample preparation samples are aging in a weathering chamber. Delta E calculated from   LAB color measurements is a critical performance metric for color stability property. It quantifies difference between the initial color and the color at different aging times of a cured sample.   We developerd a D-optimal DOE to select a representative subset from our available polymers. We use the Monte Carlo simulation to evaluate number of repeated measures needed to obtain   80% of statistical power detecting detecting the main and the interaction effects. The collected data has unequal time intervals among repeated color measurements.   Therefore, we developed a random coefficient model, RCM, to incorporate the sample specific effects and provides more influence in variability between samples over time. We also compared RCM with linear regression model.   Which completely proves the data by assuming a common variance for all samples across time.   With the methods described here are the key results for this work.   Figure two shows a Monte Carlo simulation based power analysis results for main and interaction effects three under different scenarios.   If we expect medium auto correlation level, which is about .5, it will adjust and time points. Number of repeated measures should be at least nine   per cured sample. If the autocorrelation between adjacent time points is really high about .9 the   statistical power drops significantly for most of the effects. Since we selected relatively large time intervals between repeated color measures for all DOE samples we assume that the repeated measures that will have medium level   autocorrelation there for nine repeated color measures per cured sample are collected for this DOE.   A general RCM model is showing in Figure three here, where we have a sample specific random intercept and sloping effects.   In addition to the main and interaction effects of the EPDM structural factors in a linear regression model.   The random effects covariance parameter estimate table here shows that there are significant difference in starting Delta E   and changing rate of Delta E among the cured samples. This indicates that is really important to account for variability between samples over time.   The profile for the RCM model shows that the confidence interval around the prediction line for our input factors are relatively narrow comparing to the scale of the Delta E in our data collection.   So the figure four shows that if you treating data as independent in a least square model could really see where they inflate degree of freedom as showing the top graph and   see where they inflate the degree of freedom, so we would be overconfident about our significance results for the model effects in the model compared to RCM models where we assume like the data should be time dependent.   And the model prediction plot and the residual plot in figure five shows that RCM has good model fit and meets our model assumptions.   Our conclusion for this work is we identified dominant microstructure factors and significant interaction between two microstructure factors suggesting alternative polymer design we developed   fundamental understanding of EPDM weatherstrip discoloration mechanism and demonstrate the power of statistical DOE and modeling using JMP to support the development of new EPDM rubbers with superior color stability.
Stan Young, CEO, CGStat Warren Kindzierski, Epidemiologist, University of Alberta Paul Fogel, Consultant, Paris   Researchers produce thousands of studies each year where multiple studies addressing the same or similar questions are evaluated together in a meta-analysis. There is a need to understand the reliability of these studies and the underlying studies. Our idea is to look at the reliability of the individual studies, as well as the statistical methods for combining individual study information, usually a risk ratio and its confidence limits. We have now examined about 100 meta-analysis studies, plus complete or random samples of the underlying individual studies. We have developed JMP add-ins and scripts and made them available to facilitate the evaluation of the reliability of the meta-analysis studies, p-value plots (two add-ins), Fisher’s combining of p-values (one script). In short, the meta-analysis studies are not statistically reliable. Using multiple examples, our presentation shares our results that support the observation that well over half of claims made in the literature are unlikely to replicate.     Auto-generated transcript...   Speaker Transcript Stan Young haha. Sara Doudt allow you to. Stan Young see everything now. Okay, oh no Am I gonna run the slides from my side. Sara Doudt yeah yeah just like you would a normal meeting. There we go okay cool make sure I got everything if I'm close outlook Okay, we are recording I need to I'm gonna I'm going to turn off my camera just fyi and then. Okay. Stan Young To make this full screen so now just sort. of how to do that the second. Sara Doudt it's that bottom like in that, where it says notes on their conduct to the 34% it's the one that looks like a. Down the boat yep sorry about that one bad, on the other side of the person that. One yes, that guy. Stan Young haha. Sara Doudt You go, and I mean I'm I'm going to go we're going to start in just a second but I'm going to meet myself, do you have any questions before we get started. Stan Young or not really are you going to be recording me as well as the slides. Sara Doudt yeah yeah they asked that you have the. The camera up here. Over that's fine. Right. All right. Stan Young I'm ready to go. Sara Doudt I'm gonna mute myself, and if you have questions or need to stop, let me know but otherwise I'll let you. Stan Young Go. I'm going to present today JMP add-ins and scripts for the evaluation of multiple studies. Or you can call this how to catch p-hackers, people cheating with statistics, or why most science claims are wrong. I'm first going to describe the puzzle parts and how they fit together. Most claims actually fail to replicate. These are science claims. I'm contending, and my co-authors are contending, that this is due to p-hacking and it's a major problem. We're going to use meta analysis and P-value plots to catch them, so this is how to catch a crook. The JMP add-ins and scripts are P-values from risk ratios and confidence limits. They come from a meta analysis. And then Fisher's combining of P-values, an ancient technique which is similar to meta analysis technology. And then we're going to present a p-value plot, and we have a small script that will clean that up and make it into a presentable picture. Well, we see a bunny in the sky. There are lots of clouds in the sky, and if you sit out on a nice day and look around, you can probably find a bunny. So this is a random event. The bunnies are not actually in the sky, of course. Gelman and Loken actually published a small article and said the statistical crisis in science, so people are using statistics, they say, incorrectly and that's what we'll talk about today. Let's run an epidemiology experiment. We have 10-sided dice, red, white, and blue. They will become digits (.046 in this particular case). Let's just actually watch this happen in front of our eyes. Now we have a P-value. It's random. You did it yourself. It's so much fun, why don't we do it one more time? Red, white, and blue. Now, if you do that 60 times, you can fill in the table here. This is a simulation that I did with 10-sided dice. And you can see the P-values in four columns and 15 rows. So that's 60 P-values. My smallest P-value is .004. And with 60 P-values, you can work out the probabilities. About 95% of the time, you will have at least one P-value less than .05. Here one done by my daughter. She had three P-values less than .05; they are circled here. And then running an epidemiologist experiment is so easy that even my wife Pat can do it, and she has three potential papers here that she could write based on rolling dice and spinning a good story. P-values have expected value attached to them. So over on the right, we have P-values of .004, .016 and .012. Attached to each P-value is a normal deviat. You can see that my .004 would have a normal deviate of 2.9 and so forth. And on the left, we have the sample size, the expected deviation, and the expected P-value for the smallest P-value and then the deviation. So if we had 400 questions that we were looking at, the expected P-value would be .00487 and the deviate would be 2.968. Now it's this deviation that is carried from the base experiments into the calculations of a meta analysis, and we'll see that as we proceed along. How many claims in epidemiology are true? I published a paper in 2011 and I took claims that had been made in observational studies. And for each claim in the observational studies, I found a randomized clinical trial that looked at exactly the same question. So in the 12 studies, there were 52 claims that could be made and tested. If you look at the column under positive, zero of those 52 claims replicated in the correct direction. So the epidemiologists had gone 0 for 52 in terms of their clalims. There were actually five claims that were statistically significant, but they were in the direction opposite of what had been claimed in the...in the observational studies. We're gonna look at a crazy question. Does cereal determine human gender? So if you eat breakfast cereal, are you more likely to have a boy baby? Well, that's what the first paper said. This paper was published in the Royal Society B, which is their premier biology journal. So these three co-authors made the claim that if you eat breakfast cereal, you're more likely to have a boy baby, eat breakfast cereal in and around the time of conception. Two of my cohorts and I looked at this, we asked for the data. We got the data and then we published a counter to the first paper saying, cereal-induced gender selection is most likely a multiple testing false positive. at the time of just before conception...predicted conception, and right around the time of conception. And there were 131 foods in each of those questionnaires, making a total of 262 statistical tests. You compute P-values for those tests, rank order them from smallest to largest, and plot them against the integers (that's the rank at the bottom), you see what looks like a pretty good 45-degree line. So we're looking at a uniform distribution. Now their claim came from the lower-left of that and they said, well, here's a small P-value. The small P-value says eating cereal...breakfast cereal will lead to more boy babies. Pretty clearly a statistical false positive. P-value plots. So we're going to use P-value plots a lot. On the left, we see a P-value plot for a likely true null hypothesis. Elderly long-term exercise training does not lead to mortality or mortality risk. On the right, we have smoking and lung cancer. And we see a whole raft of P-values tracking pretty close to zero, all the way across the page and a few stragglers up on the right. So this the right-hand picture is evidence for a real effect and the left-hand picture is support of the null hypothesis, no effect. let's talk about a meta analysis because I am going to use those during the course of this lecture. On the left, we see a funnel and we see lots of papers dropping into the funnel. The epidemiologist or whoever's doing the meta analysis picks what they think are high quality papers and they use those for further analysis. On the right, we see the...sort of the evidence hierarchy. A meta analysis over many studies is considered high-level information and then down at the bottom, expert opinion and so forth and so on. So the higher you go up the pyramid, people contend the evidence is is better. We're going to look at two example meta analysis papers. The first paper is by Orellano. nitric oxide, ozone, small particles and so forth. And it gathered data from all over the...all over the world, and this was sponsored by the WHO, so this is high-quality funding, high-quality paper. And it's a meta analysis. We're going to see if the claims in that meta analysis hold up. The bottom one was really funny. Patterns of red and processed meat consumption and risk for basically lung cancer and heart attacks. There's been a lot of literature in the nutrition literature, saying that you really shouldn't be eating red meat. We're going to see if that...if that makes sense. Let's go back and look at the puzzle parts and then see how they fit together. We know from the open literature that most claims fail to replicate. This is called a crisis in science. P-hacking is a problem. P-hacking is running lots of test trying this and trying that and then, when you find a P-value less than .05, you can write a paper. We're going to use meta analysis and P-value plots to catch the people that are basically P-hacking, and I call P-hacking sort of cheating with statistics. Others are, you know, little...described it a little differently. We're going to be using JMP add-ins. These JMP add-ins were written by Paul Fogle. And they allow us to start with a meta analysis paper and quickly and easily produce a P-value plot. We're also going to describe Fisher's combining of P-values and then we're... have a small script that will take a the two-way plot that comes out of JMP and then clean it up, so that the left and right margins are more...more attractive. Well here's long-term exposure to air pollution and causing asthma, so they say. On the left, we have what's called a tree diagram. That's the... it looks sort of like a Christmas tree. The mean values are given as the dots and then the confidence limits are the whiskers going out on either side. On the left are the names of the paper...papers that were selected by these authors and on the right are the risk ratios and the confidence limits. Now you can often just scrape the risk ratios and confidence limits off and drop them into a JMP data set, and so we see the JMP data set on the right. Now the P-value plot. thing written by Paul Fogle is going to convert those risk ratios and P... and confidence limits into P-values. Here we see that it's been done. The confidence limits can give you the standard error. With the risk ratio and the standard error, you can compute a Z statistic. And then you can take the Z statistic and and get a P-value. On the right here, we see a P-value plot coming out, just as it does with with a couple of clicks in JMP. Here again, on the left, we have the rough P-value plot, and on the right, after using a small script, we add the zero line, we add the dotted line for .05, and we clean up and expand the the numbers on the X and Y axes. And so we can look now and judge all the studies that that were in that thing, but we see a rather strange thing. We see that the all of this here, there are a few P-values under .05 and then a lot of P-values going up, and so we have an ambiguous situation. Some of the P-values look like the the claim is correct, and others look like they're simply random findings. Let's take a look at Fisher's combining of P-values. On the left, we have the formula. You take the P-value, take natural log, that sum them up times -2. And that gives you a Chi squared. You look up the Chi squared in the table. So this...this meta analysis, we have the P-values. Under P-value, we have -2ln, and you see that there are few P-values that are small that add substantially to the summation. If you think just a little bit, the summation is like all sums, it's subject to outliers manipulated...shifting the balance considerably. And so the few small P-values (.0033, .0073) are adding dramatically to the summation of the Chi squared. In fact the summation is not robust. One outlier...an extreme outlier can tip the whole...the whole equation. Now keep in mind that P-hacking can lead to small P-values and also, scientists quite often, if if a study comes out not significant, they simply won't even publish it. And if they find something significant, they typically will. So there's publication bias. The the whole literature is littered with small P-values and this probably...that's the tip of the iceberg. Under the iceberg are all the publications that could have happened, but they were not statistically significant. We're now going to look at a air pollution study. Mustafic in 2012 published a paper in JAMA and he looked at the six typical air pollution carbon monoxide, nitrous oxide, small particles (that's PM 2.5), sulfur dioxide, etc. If you look at these P-value plots, all of them essentially look like hockey sticks. There are a number of P-values less than .05. But then there are a substantial number of P-values that go up along a 45-degree line, indicating that there is no effect. We're contending that we're looking at a mixture. We're looking at a mixture of significant studies and non significant studies. And we're further contending that the significant studies largely come from P-hacking. There are other ways that can arise, but we think P-hacking is the thing. So if you look on the left-hand side here, in the red box, we give the median number of models or tests that could be conducted in a in a study, the studies that were in Mustafic's paper. We counted the number of outcomes, predictors, and covariates, and if you multiply that all out, the median number, which we call the search space, the median search space is 12,288. That means that the authors, on the average, had 12...in the median, had 12,000 opportunities to get a P-value of less than .05. They had substantial leeway to get a statistically significant result. We're going to look at the simple counting here for the nutrition studies. So we have here 15 studies. The authors, or the first authors on the paper. And then across the top you see outcomes, predictors, covariates, tests, models, and search space. The outcomes, if they looked at three health outcomes, for example, Dixon, and if they had 51 foods in their study, that will...those would be predictors. Now if they use covariates, that would add substantially to the search space. And you can see for Dixon, theoretically, he had 20 million possible analyses that he could have done. And if you look down the search space column, you can see substantial numbers of possible analyses that could be done. The nutrition studies were done with cohorts. A cohort is a group of people that is collected up, they're measured in questions, initially, then they wait over time and then they look at health effects. Each of these cohorts has the name of a data set that the cohort uses. And the cohorts, once they are assembled, can be used to ask, you know, a zillion questions. If the P-value for one of those questions comes out significant and they feel like they can write a paper, quite often they do. So, in the last two columns, we see the numbers of papers that could arise. That are more...we used Google Scholar and we actually checked it out. So these are the number of papers that have appeared in the literature associated with each of these cohorts. I'll add that we've looked at a lot of these papers and in none of these papers is there any adjustment for multiple testing or multiple modeling. Nutritional epidemiology, environmental epidemiology, those are the ones we talked about here. Nutritional epidemiology uses a questionnaire, food frequency questionnaire (FFQ). In an FFQ, you can have some number of foods. Initially, they started off with 61 foods and I've seen some FFQ studies where there were 800 foods. I wouldn't want to be the one to fill out that questionnaire. But given here are the number of papers that use FFQs over time. Since 1985...the technique was invented in 1986...since 1985, there have been 74,000 FFQ papers, and based on our looking at them, none of these papers adjusted for multiple testing and all of them had substantially large statistical search spaces. Environmental epidemiology, I simply did a Google search on the word "air pollution" in the title of the paper. And here we see that over time, there've been 28-29,000 papers written about air pollution. So far as I know, based on a lot of looking, none of these papers adjust for multiple testing or multiple modeling, and all of these papers have large...essentially all of these papers have very large search spaces. Meta analysis goes into the name systematic review and meta analysis and, recently, starting in 2005, journals have used that term in the title of the paper. So starting in 2005, I asked if the words "systematic review" and "meta analysis" appeared in the title of the paper, and you can see that started off low, 1,500 papers a year. 1,500 papers in that five-year period and finally in 2021, there have been a total of 27,000 papers, so this is a cottage industry. These papers can be turned out relatively easily. A team, often in China of 10 or five to 15 people, can turn out one meta analysis per week. And their their pay is rated on how many papers they publish and so forth and so on. So far as I know, these are all...half of these studies are observational studies and half are...come from randomized clinical trials. Virtually none of these and in all the ones that we've looked at, so far, particularly in observational studies, have this hockey stick look, like there's some P-values that are small and there are a bunch of P-values that look completely random. I will say that all the...essentially all of these studies are funded by your tax dollars or somebody's tax dollars. They're very lavishly funded by the public purse, is one way to say it. Many claims have no statistical support. The base papers do not correct for multiple testing and multiple modeling. The base papers have large analysis search spaces. And we've seen examples from environmental epidemiology and nutritional epidemiology that most people, I would say based on the evidence, are unreliable. I say and others have said too, we have a science and statistical disaster and use of meta analysis and P-value plots, these claims can be either verified true or not. Here we have four situations of smog. Upper left is London 1952; the right is Los Angeles 1948. The lower left is Singapore and I'm not sure...I don't remember the exact year for that. And then we have Beijing. Those are recent. In the case of the London fog, statistical analysis of death...daily deaths and everything indicate that there upwards of 4,000 deaths that occurred in a three- or four-day period in London in 1952. That instigated the interest by epidemiologist in what was the killer. The other three slides or other three pictures, there has been non reported increase in death during those time periods. The ringer is that in London, they were burning coal for heating and everything else. There was a temperature inversion and the contention is that acid in the air was carried by particles into the lower lungs and susceptible individuals, usually the old and weak, died in increased numbers, but that is not happening now around the world. There is all kinds of pollution around the world and we don't see spikes in death rates associated with it. Scams? Ha ha. I love scams. Are these scams? Air pollution kills. Any claim from an FFQ study, for example, if you drink coffee, you're more likely to have pancreatic cancer. That's one of the claims that's probably not true. Natural levels of ozone, do they kill? I think that's not true either. Environmental estrogens. Any claim coming from meta analysis using observational studies. Keep in mind, the evidence in the public literature is that 80% of all, usually coming from university, science claims failed to replicate when tested rigorously, for example in randomized clinical trials. Emotion and scare. The whole aim of practical politics is to keep the population menaced, scared and hence clamorous to be led to safety by menacing the population with an endless series of hobgoblins, all of them imaginary. Flim flam, deception, confidence games involving...involving skillful persuasion or clever manipulation of the victim. So H.L. Mencken in the 1930s said that practical politics is largely scare politics. We finish up with the the authors of this in this of this talk. I'm Stan Young and I can be reached at genetree@bellsouth.net. Warren Kindzierski is a Canadian epidemiologist and he's been working with me very closely for the last couple of years on papers that are based on the things that we've seen here today. Paul Fogel is a very interesting statistician, lives in Paris and he was responsible for the writing of the add-ins and scripts that we can use. I will say that the scripts will allow you to go from a meta analysis to an evaluation in probably a little bit less than an hour. I'd like to recommend the National Association of Scholars report, Shifting Sands Report 1, this URL will get you there. This report is a long and involved involved, but simple simple statistics report talking about air quality and health effects. And with that I'll also make the following claim, if someone watching this wants to try out what's going on, we will give you the scripts and add-ins to JMP and then I will even help you look at and interpret your particular analysis of a meta analysis. So with that I'll stop and I'm prepared to answer any questions. Thank you.