Abstracts

0 attendees

0

Sunday, March 7, 2021

Wayne Levin, President, Predictum Farhan Mansoor, Software Engineer, Predictum Innovation in industry requires the contributions of analytical knowledge – more specifically formal and informal experimental data and predictive models – to product and process design. However, analytical knowledge is often stored and unmanaged in isolated sources for individual use only by its creators. By our estimation, analytical knowledge is typically regenerated on average about 40 percent of the time, simply because prior, relevant knowledge was not made accessible to the people who could make effective use of it. Regenerating analytical knowledge carries higher risks, incurs unnecessary costs and delays achieving business goals. The future success of technical problem solving and process innovation requires a modern knowledge management strategy. Companies that adopt a modern knowledge management strategy will dramatically reduce the amount of time and effort in the daily work of engineers and scientists, thereby not only preventing the needless duplication of experiments, but also extending the use of past experiments and improvement initiatives. Wayne and Farhan will present a use case to demonstrate the power of managing analytical knowledge through CoBase, an enterprise-level knowledge management system that enables engineers and scientists to share access to and collaborate mutually on past analyses and relevant supporting data. Auto-generated transcript... Speaker Transcript Wayne Levin Well, thanks, very much for joining us. My name is Wayne Levin and joining me in our presentation about knowledge management for faster problem solving and reduced time to market in engineering and science is my friend and colleague, Farhan Mansoor. So Farhan is going to help me with the demonstration part of this and we're going to start out with...why don't we start with a little agenda here. I'm just going to give just a quick introduction to us...we're Predictum, as a company, just so you know a little bit about us. And the real focus is going to be how to improve productivity in science and engineering. That's what we're... that's what we're here to talk about today. So a little bit about us. Our goal is to accelerate problem solving, improvement, and research and development. And we do that through analytical training and consulting and we build integrated analytical systems. And we're going to look at an example of that today, with CoBase. Just so you know, we're also a JMP partner, have been for a long time now, been associated with JMP for close to 25 years. Predictum as a company started in March 1992, so almost 29 years. So that's a little bit about us, now let's get to the matter at hand. I want to talk about how managing knowledge, not just data, as an asset, so managing knowledge as an asset will dramatically improve productivity among your researchers, engineers, and business analysts. And so I'm going to start by asking you a question. I'm how often does someone regenerate what was already known in your company? If you have to put that on a scale between zero and 100, what would you say? I'd like just like you to plant a number in your your head, okay? There's no need to confess. Alright, so it's some problem that was solved, but it was solved before. Others have dealt with that problem or some insight, some relationship or association between variables. Chances are others have probably made that discovery already. That's what I mean by regenerating knowledge. Do you have a number in mind? Well we've been asking companies, dozens of companies now since the fall of last year, and this is what we've seen. It's on average about 40% and a pretty big range as far as that goes. So nobody has really a solid number for that, and I think that actually speaks to the problem, that we don't have a solid answer. But you know when thinking about it, you know, we have some people who just roll their eyes and they go, all the time. We had one company, a few weeks ago, three people were on the call. One said 80%, one said 60%, one said, oh, at least 50. So, and we do get some who are...the lowest I've ever heard is like 20%. Why do we tolerate that? And what other...what other aspect of business would we tolerate something like that? So you know the...regenerating what was already known imposes higher risks, it costs money, it delays objectives and it's really a lost opportunity to accelerate problem solving, improvement and R&D. So this is the problem...I want to make sure we're clear, this is a problem we're trying to tackle here. Now you probably know that expression. I'm not going to fill in the blank. I'll leave it to you. Blank happens. So you probably experienced these situations, the phone rings, an email comes in, there's some customer issue or there's some production problem, or some obstacle has surfaced and it's causing delays in a new product introduction or or process development and characterization. And you're frantic, right? You've got to take care of this, and you're thinking, okay, who knows how to solve this? Who's got that knowledge? And you're...basically you're trying to identify the right people and that can be difficult, sometimes it's hard to to do that. Or you can identify some good candidates and they're not available or they've retired or they're away, they're on vacation or they've been reassigned and you're not allowed to talk to them anymore. They're simply gone. They're just no longer with the organization. How do you handle that? This is...this is a problem we want to avoid, and this is why we say companies typically manage their materials and spare parts better than they do their knowledge. I hope that doesn't sound too harsh, but I'd like you to think about that for a moment. I was asking you earlier about how much knowledge is regenerated. And you probably don't have a hard number for that, but I'll bet you, you probably have a pretty good number on what your work in progress inventory is, or spare parts inventory, raw materials inventory. You can probably find somebody and ??? So why is it that we manage those assets very well, but we don't manage knowledge very well. And that's because knowledge is... well, this is, this is where it's at, right? The brain, we like to say, is a great knowledge creator but it's a lousy knowledge container. We can't access it. We can't index it. We...you know, it walks out the door at the end of the day. So knowledge cannot accumulate if the brain is the primary storage device for knowledge. And we believe that companies should always accumulate knowledge, so if you have...and I'll just look at just what's involved in knowledge creation, if you will. I'll do that. And first we're going to just talk about what is knowledge? What do we mean by knowledge? And so, knowledge is what allows us to predict, at its very core, we're talking about a prediction formula, because we use a prediction formula to predict. And so we, you know, we can associate height to weight or height, age, and sex or whatever to to weight, and all of us who use JMP are familiar with these things. So that's the... that's the third component, if you will, predictive model, a prediction formula, if you will. But when we have that, we would like to know something about the analytical method or the process that was used to generate that prediction formula and, of course, those of us, you know, with JMP, we have those over here, right? We save them as scripts, if you will, so that we can regenerate that formula. Of course, the formula would be saved as a...as a column to the data table and then that's one of the things we do with the analysis. So that's what's happening over here with number two and, of course, the primary thing, the first point is is just data. But as I said at the beginning, data is not knowledge. You know, data is data. It's the raw material that we get from instrumentation, helps us understand what's going on with products or with processes, so we really need these three components. So knowledge is what allows us to predict, but we...we would like to go back a couple of steps as well, to really have the full context of that knowledge. So let's think about that knowledge creation here. You know, we have engineers, scientists, analysts. I'm going to think of it from a scientist's point of view. They are, you know, using their instruments, they're doing their work, they're collecting their data. Now we find all too often it's kept on spreadsheets, but of course we're... we're talking JMP data tables as well, and that's that's a terrific thing. They collect it, they do their analyses, and this is why we call it personal computing. And that picture of the computer there, it's just meant to remind us, personal computing has been with us a long time and it's relevant. It's important. It's still the case. Analysis does happen in a brain. It's a personal endeavor, if you will, that really can't be split up so just at its core. So we've got a bunch of people who are doing this and as a result, this work is typically siloed. All right, where do these files go? Some people will save them on Sharepoint or shared network drives. That's great, but primarily they end up on a laptop, and so it is inherently siloed, just because that's the nature of it. And researchers, you know, we can't easily access the experience of others. I mean, if we've got a problem we want to solve, we're thinking about it, you know, as an experiment, we might call together a group of, let's say, five people in a room for, you know, for an hour, or 10 people in a room for an hour and brainstorm, you know, what factors, what levels to go to, and all that. It's a good and worthy activity but wouldn't it be better not to start from zero and not to occupy those people? That's a good amount of time that's that's being used. So what we'd like to do is be able to take the experience, if you will, saved in these JMP data tables and related files, and Farhan's going to show us this, make it easy to put it in a database. Right, this is what we call CoBase and so that way, new initiatives don't start from zero. Anybody can go in and look up what, you know, who else has looked at a particular problem or the particular area. And this way, they'll never unknowingly pay for the same insights more than once. This is an important point as well. I find too often in experimentation or work, the work that folks do, because it's siloed, they don't know that what they may be seeing is inconsistent with what others have seen in the past. So this way they can look and see maybe they're dealing with Type one or Type two errors and it gives them, you know another, just another dimension, if you will, that really should be considered, that sort of historical dimension that's just often not brought forward because they can't typically. So when we talk about capturing and preserving and reusing knowledge, obviously, there's data, like I said. And many companies will have databases, of course, so they'll have LIMS systems, and this is terrific. This is good for preserving data. Some will have a formal document management system and and, if not, they'll have some way of organizing and filing reports. You know powerpoints, standard operating procedures, if you will, the results of the analyses that engineers, scientists and other analysts are doing. But that's where we want to focus, we want to talk about preserving that knowledge creation work and making it identifiable, if you will, making it so others can find it. So just that, if we want to manage knowledge like an asset, it requires that you bundle it, first of all with data and reports (that would be a good idea) as much of that as relevant as possible and package it, okay. Because when it's packaged, it's identifiable. I keep something on my desk here. You know, I bought a USB cable from Amazon. It came in this box, I hope this shows up all right. And you know it's got a barcode there, it's an asset, it was to Amazon, it is to me now and it's identifiable, right, so that, you know, we can search for it. And if you can search for it, then you can retrieve it, and if you can retrieve it, then you could reuse it. Alright, so you can reference it, you can challenge that knowledge. We think all knowledge is open to be challenged, of course, and we can improve on that knowledge. And finally, managing it as an asset means that we keep it secure, that the knowledge is is is under your control. Okay, so we have a couple of products in this area. We're going to switch over to the demonstration part of this talk. First, is SashLab. We're not going to demonstrate that, but if you want to read more about it, you can download these slides and read up about it, and of course you can contact us. What it is, it's a like a virtual lab or a digital twin. It allows you to experiment virtually or check things out virtually, if you will, before doing...making changes or experimenting physically. CoBase...what CoBase is designed to do is capture everyday research and experimentation and analyses, improvement initiatives, whether they're formal or informal. It puts it into a database and it...and it tags it. And so Farhan is going to show us about tagging and and it also tags...remember, I said it needs to be identifiable, so tagging is one way. Indexing it by factors or by responses and by domains, if you will. We'll talk a bit about that and and this way anybody can go and look... look up the knowledge that was generated by others. And you know, both of these applications, it says at the bottom here, it's like they capture explicit knowledge as an asset. You want to keep it explicit, right. We we don't want to just have opinions or or notions about what's happening when we asked someone we'd like, hey show me the data, the method that it was modeled and the resulting model. It's it's hard that way. It's explicit. So we got the model, we got the development method, we have the underlying data and we make them available for reuse by others. And this way, it avoids delays and costs and I'm going to say anxiety associated with searching for and regenerating knowledge. So if we could, Farhan, why don't you take over the screen and what we'll do is we'll begin a demonstration here. Just while Farhan's bringing that up, let me just add about one of the things we hear about from people when we talk to them is that, you know, it's hard to search for things, just to begin with. It's hard to search for things, but it's really hard to search for things that you don't know exist, right, because that kind of search can go on forever. And so this is kind of what I mean by the sort of the anxiety. It's tiring have to deal with that, and when we are dealing with re search, we would like that research not to have to involve searching for past knowledge. We want to make that easy. So what I'm going to do here, Farhan, you've got the CoBase interface, the primary GUI open? Farhan Mansoor Yes. So, so the homepage yep. Wayne Levin So Farhan, let's say I'm...let's let's go and do a search. And because I've got a problem here, and just before we do this, so let me just describe a little bit at what's going on. We're not going to go into all the detail here. We just don't have the time for it. We're taking the... the example we're using is like a manufacturing process. So down in the bottom left, we have the various steps in semiconductor manufacturing and it's just a way of grouping factors, okay. So what I'm interested in is, you know, I'm interested in you know, let's say, the deposition here. And I'm curious to know if we look at deposition...let's look at like the deposition rate, all right, which is a factor here. And I just want to know, has anybody looked at this in the past deposition rate, let's say between 700 and 3,000 angstroms. It could be 100 reasons why I want to know this and I just want to know, well, what did they look at? What were they... what were the other factors they were looking at? What were the responses? You know, what was the con...you know, I just want to see what's there, because I want to understand this factor. So go ahead, Farhan, you you you fill this in and... great. You clicked on search and there we are. We've got a bunch of files so. You know, we got about what seven files there and they...just looking at them, they go back...one goes back as far back as 2016. So this is work that people have done previously. They've updated it to CoBase. Now, it can be hard to look at all those files, you know, all at once. We can download them. We will do that momentarily but on the right, we're kind of looking at it at a glance so...Farhan, do you want to just...let's look at the parameter distributions first of all. So we see that three of the these JMP data tables also included argon flow that were all looked at in the same levels and backside flow were involved in a couple. There's deposition rate so there's seven across there and we see the different levels there, but they're all between 700 and 3,000 and there are some others as well. We can also look at the just some statistical things. We're going to be adding some more stuff here, but the idea is just to be able to look at some things at a glance, so we get an idea what's going on. And so on the left, we have R squares. On the right, we have root means square errors. Each of the vertical arrangements of dots relates to a JMP file, so I'm just looking at Exp 18-11-01. There you are, Farhan, yeah. So those are three models that have been produced, three prediction formulas that have been saved and we see the R squares vary quite a bit there. Farhan, why don't we go download that file and and let's let's just have a look at at what's there. So. There we go. Awesome. So by the way, the modeling type just also happens to be shown bivariate or fit least squares if they're blue. So we get an idea what what it is. So there we go. Why don't we just go run those three scripts there. Because the predicting...prediction formula is there the...how it was generated is there. Forhan's just rerunning it so. And, of course, the data is there, so we really have the full context of just what was being done. We can see there what the work was. Thanks, Farhan, you're arranging on the screen, I think you ran...well one one twice by the looks of it, they look identical. Farhan Mansoor Oh yes. Wayne Levin Yeah but that's all right. What what we can see if we even if we just look at these two, this is fine. Notice that the one that significant, if you will, it involves temperature. The one that isn't, over on the left of it, temperature's not involved and that may explain why it's not a significant model so. We could go any number of ways, with this, but I hope you get the idea that we want to be able to, you know, search for something, quickly find out what's available, get at a glance what was going on, and then, you know, look at it as much as we want. We may want to now take this data, maybe it solves a problem, maybe the problem I'm dealing with is solved right here. So boom, I don't need to do anything further. I've got my problem solved. Or maybe I want to augment this design, you know, and add some other runs or add some other factors. Or maybe I'm just, hey I see that I need to involve temperature when I go forward and I may not vary it, but I know I'm going to keep it up at a particular level, as I go forward to improve the power of subsequent studies that I may do. So you see what I'm saying, like I'm drawing knowledge from the past, so I'm not beginning from zero. So that's the idea. Why don't we do another quick look up, Farhan. Show another way of looking things up. There's various other ways, but I think a common way would be by tags. The tags are completely customizable so we've got...what do we have there...analysts, we have project, you know. So you've probably had this. I know I've had this. Somebody new comes in and we assign some work to them and we say hey, why don't you look up, Farhan, let's do it by project, and let's just say, hey, why don't you look up the CVD improvement project and there's another one that's kind of like it...if you... YieldPlus. Yeah why don't you go and look at the data, the analyses that were done with this, and then go ahead and click search, and then bang. The files that were associated with it. I know what you're seeing here, by the way, they're all JMP files, but we're going to show you that on the upload, you can include non JMP files as well. So you may want to include, you know, some pictures, pictures of defects, pictures of equipment, instructions, other, you know, documents, anything you want, and they would be listed there as well. I'm sorry, I'm pointing to my screen so. They'd be they'd be listed there as well, so there's other ways to do searches and more comprehensive searches, if you will, but I think you get the idea. And I'd like to ask Farhan, would you mind taking us through the upload? Yep. Yeah why don't you talk us through that, okay? Farhan Mansoor Let's go to the upload interface. Just the upload interface. So what I'm going to do is I'm going to add some sample files to demo the upload process. So you can upload both JMP file and also any other kind of non JMP file. So I'll show you the difference in the process for both. So what I did, is I am picking some PDF, docx, spreadsheets and one JMP file. Now if your file format is JMP then what CoBase will do, it will parse some information out of the file, for example, it will list the column names, as well as, you know, the units, models, you know, data set, those kinds of information. It will also try to guess a standard name, so this I will show, in the end, how to set up but admin users can set up standard names for various parameters and users can choose the standard parameters from here. And if they already exist in the system, CoBase will try to guess. So, for example, in this case temperature has been associated with the temperature parameter that exists in CoBase right now. Argon flow, that doesn't have a assigned parameter right now, because it's new to the system so user can go and pick a standard name. So if I know that argon F L W is the same as our flow in the deposition of step, I can select that one as my standard parameter. Now this is optional. Users don't have to do it, but if they standardized their parameters, their columns then it just makes makes it easier to search for various things. But, or they could come back and do it later, but yeah, but right now, it's an optional feature, optional process. Wayne Levin One of the...one of the keys around this just just to make clear, the column names that are on the left, they're from the JMP data table that Farhan identified. And in order to search for something, we have to have a standard. We have to agree to a standard. So basically what CoBase is allowing us to do, which is what the CO stands for, is we're collaborating asynchronously with you know, colleagues. There's a lot of co names here. We're cooperating. So. We have to agree on the names and so what we're trying to do is make it really easy when you go to upload to assign the proper names to it. So Farhan just said that it's optional. It's true, because we want to make sure that people upload. We know if we make it difficult, they're not going to do it. They'll say, oh, I'll do it later today, and then they won't. Right, then, I'll do it tomorrow. I'll get to it tomorrow, and then they don't. You know, that type of thing. The other thing that we can do is identify, you know, tables that have been uploaded that don't have standard names in them. Like that would be an admin function, if you will, and so they can be corrected later, alright. So so we've got now the nomenclature...are essential to facilitate, you know, system optimization. If we're going to cooperate, we have to agree to the...we have to agree to a standard so that's a big part of it. And something we just added recently, Farhan, you just mentioned the units of measurement. So that's that's part of this as well, so if you're uploading and you'll be reminded of what the unit of measurement is or what it should be and we facilitate changing that if, indeed, you know, somebody measured in millimeters but the standard is centimeters or nanometers whatever. So anyway, we want to facilitate that standardization. Yeah, what else should we mentioned here? There's the tagging you can do, comments, go ahead, Farhan. Farhan Mansoor Yeah so you can add a comment at the file level or you can add a comment at the batch level, have a general notes about the entire upload. The tags, you can also add here, so these are the preexisting tags. So you can add, like if I want to add a tag, I can add series one technology tag, things like that. Wayne Levin Okay, why don't we upload it. Farhan Mansoor Yep, let's upload this. Wayne Levin So just so you know what this consists of, is on the back end, it's a SQL server database and on the front end, it's it's a JMP add-in, so we're running this in JMP obviously and it's installed, like any other add-in. This configuration that needs to happen. CoBase can be installed and up and running in literally in minutes. It really just depends on you. You have a script to install the database and double click, install the add-in, your configuration, and boom you're up and running. So it's pretty easy to do that and oh, we're up there and. Why don't we do a quick search, yeah. Farhan Mansoor So once it finishes uploading, it will kind of...it will give you a batch ID for reference, so you can also look it up by that. So if I do a search for that. Wayne Levin Of course you can look it up, based on the factor names and so on, as well, but yeah, we're just gonna put the batch ID in there, just so we're focused just on this. Farhan Mansoor So that this the files I just uploaded. And you can see the similar plots, parameter distribution plots, if there are any models, it will show up here, since it has only one JMP file. There's so much, and we can download the files as well, so if I download a JMP file, it will open within JMP. But if it's a non JMP file, for example, if it's a doc file, it will open with your default doc viewer, so in my case, it would be Microsoft Word. Wayne Levin Right so flows back in the original search. If we search, hey, who's looked at argon flow, or what have you, we not only get the JMP file here, like what we see here, but we'd see these other files associated with that as well, so so they would come back at you as well. So that's a little bit about uploading. Again we're trying to facilitate the standardization and we're, again, trying to make it easy, really easy to do. Now of course you'd have a bunch of CoBase users out there, and you'd also have a few people who would have the administrative privileges. Why don't we just have a quick look at that, Farhan? Because this is where...we don't get too deep into this. If you want to see more about this or talk more about this, we can talk during the questions or you can contact us after, you know, at any point. So, are you. There we go. Farhan Mansoor On it. Yeah. Wayne Levin I want to say just briefly about how the, you know, setting up the parameters are done here. Farhan Mansoor Yeah so on the left, you see all the steps or well, we call them domains. This could be your production steps or product components or subcomponents. And if I click on one of the steps here, it will show you all the parameters that are currently existing...currently existing this this particular domain and also the subdomains. And admin users can come and add new parameters or edit existing ones, so this will create those standard names that users can then select during upload. The admin users can also assign a standard unit, so on the right side, you see all the standard units associated with the standard parameters. Wayne Levin Right and then the tags there. was mentioned .... there. I'm just gonna go back to the parameters for a moment. You can change the names of these parameters. I'm sorry, we missed a little something there. We could show you that, remember we uploaded a table and we changed argon flow, we changed the name. Well, when we download that table, it will have the correct name and if we ever decide that, you know, for whatever reason, we want to change some of these standard names over time, you may decide that something's a little more descriptive or, you know, you may just want to change it, so you can do that. You can change them here in the admin panel and that will make, let's say, changes within the system so that now, you can search based on those new names and the history will still be brought forward. So we've added that flexibility. It was one of the most difficult things, maybe the most difficult thing, in terms of building CoBase just to begin with. I'm sorry, Forhan, I was taking you away but let's look at the tags, just so they get a sense of that. Farhan Mansoor You have a set of tags that admin users can create. Here you can add new tag types or new tags inside tags, tag types, right now, we can see few examples here. For example, technology tags, study type tags, things like that. Wayne Levin Yeah we have for technology, we have one company, who said look, you know we have different eras, if you will, different technologies and we don't want to throw away stuff that, you know, was done from prior versions, if you will, for a prior technology. So they wanted to be able to name that and so indeed they are are able to do that. You know it's obvious, probably want to tag by analysts, you know, so you can go by somebody's name or whatever, or some project ID. You know, those are pretty obvious tags, but you can create any tags that you want, and you can add tags anytime you want, you know, to this as as they occur to you. So that's the the demo side of this that we wanted to show you. I hope that gives you a flavor for it and, again, you know we welcome any questions that you may have or comments. I'm just going to ...I'm going to switch it back over to my screen. Thank you, Farhan. And see if I can get this. Okay, so we're happy to entertain any questions or thoughts that you may have. Oh goodness, I'm sorry we're gonna have to edit this out, this is the wrong slide. So I'm going to back up. If you if you have any questions or comments, in the slide, we'll have our contact information, when you go to when you download it. Feel free to reach out. We'd be happy to do a more extensive demonstration or talk about some challenges you may have or problems you may have, and how we might be able to solve them with CoBase. And I really appreciate your interest. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Laura Castro-Schilo, JMP Senior Research Statistician Developer, SAS James R. Koepfler, JMP Research Statistician Tester, SAS This presentation provides a detailed introduction to Structural Equation Modeling (SEM) by covering key foundational concepts that enable analysts from all backgrounds to use this statistical technique. We start with comparisons to regression analysis to facilitate understanding of the SEM framework. We show how to leverage observed variables to estimate latent variables, account for measurement error, improve future measurement and improve estimates of linear models. Moreover, we emphasize key questions analysts can tackle with SEM and show how to answer those questions with examples using real data. Attendees will learn how to perform path analysis and confirmatory factor analysis, assess model fit, compare alternative models and interpret all the results provided in the SEM platform of JMP Pro. Auto-generated transcript... Speaker Transcript Laura Castro-Schilo Hello, I'm Laura Castro-Schilo and welcome to this session, where we're going to learn the ABC of structural equation modeling. And our goal today is to make sure that you have the tools that are needed for specifying and interpreting models, using the structural equations models platform in JMP Pro 16. And we're going to do that, first by giving a brief introduction to SEM by just telling you what it is and, particularly, drawing on the connections it has with factor analysis and regression analysis. And along the way we're going to learn how path diagrams are essential tools for SEM. And we're going to try to keep that introduction fairly brief, so we can really focus on some hands on examples. And so, prior to those examples I'm going to introduce the data that we're going to use for that demo and these data are about perceptions of COVID 19 threats. And so, looking at those data we're going to start learning about how we specify and interpret our models and we're going to do that by answering specific questions. Those questions are going to lead us to talk about very popular models in SEM, one being confirmatory factor analysis and another one, multivariate regression analysis. And to wrap it up we're going to show you a model where we bring both of those analyses together, so you really can see how SEM is a very flexible framework where you can fit your own models. Okay, so SEM is a framework where factor analysis and regression analysis come together, and on the factor analysis side we're able to gain the ability to measure even those things that we cannot observe directly, also known as latent variables. And from the regression side we're able to examine relations across variables, whether those are observed, or unobserved and so when when you bring those two together, you get SEM which you can imagine, is a very flexible framework where all sorts of different models can be fit. Path diagrams are really useful tools in SEM, and the reason is that the systems of equations that can be, you know, fairly complicated can actually be represented, through these diagrams. And so as long as we know how to draw the diagrams and how to interpret them, then we're able to use them to our advantage. So here we have rectangles and so those are used exclusively for representing observed variables. Circles are used to represent latent variables, and double-headed arrows are used for representing both variances and covariances, and one-headed arrows are for regression or loading effects. There's another symbol that's often used in path diagrams and it's sort of outside the scope of what we're going to talk about today. But if you come across it, I just want to make sure you know it's there, and that is a triangle. Triangles are used to represent means and intercepts and so there's all sorts of interesting models we can fit, where we are modeling the mean structure of the data, but again we're not gonna have time to talk about those today. Now, when it comes to path diagrams I think it's useful to think of what are the building blocks of SEM models so that we can use those to build complex models. And one of those would be a simple linear regression. So here, you see, we have a linear regression where Y is being regressed on X. And notice both X and Y are in these rectangles, in these boxes, because they are observed variables. And we're using that one-headed arrow to represent the regression effect and the two-headed arrows that start and end on the same variable represent, in this case, the variance of X, and in the case of Y, it's the residual variance of Y. If a double-headed arrow were to start in one variable and start...and end at the other, then that would be a covariance. Now, in SEM any variable can be both an outcome and a predictor. So in this case, Y could also take on the role of a predictor if we had a third variable Z, where Y is predicting Z. And so we can build sequential effects, this type of sequential regressions, you know, as as many as you need depending on on your data. Another building block would be that of a confirmatory factor model, and so that's basically the way that we specify latent variables in SEM. And this particular example is a very simple one factor, one latent variable confirmatory factor model where the circle represents the unobserved latent variable. And notice that latent variable is has one-headed arrows pointing to the variables that we do observe, in this case W, X and Y. And the reason why that variable points to those squares is because in factor analysis, the idea is that the latent variable causes the common variability we observed across W, X and Y. And this is really important to understand because it's often confused when we think about principal components from like a principal components analysis perspective. And so I think this is a good opportunity to sort of draw the distinctions between latent variables from a factor analytic perspective and components from a PCA perspective, so I'm going to take a little bit of a tangent to explain those differences. In this image, the squares represent the variables that we measured, those observed variables. And notice, I'm using these different amounts of blue shading on those variables to represent the proportion of variance that is due to what we intended to measure, sort of the signal, the things that we wanted to measure with our instruments. And the gray shaded areas are the proportion of variance that is due to any other sources of variance. It can include measurement error, but it can also include systematic variance that is unique to each of those measurement instruments. And so, in the case of factor analysis, the latent variable captures all of that common variability across the observed variables, and so that's why we're using this solid blue to represent the latent variable. And that's in contrast to what happens in principal component analysis, where the goal is dimension reduction. And so in PCA, the component is going to explain the maximal amount of variance from the dimensions of our data. And so that means that that that principal component is going to be often a combination of the variance that's due to what what we wanted to measure, but also to some other sources of variance. All right, and so again the the diagram illustrates also those the causal assumption, the fact that latent variables are hypothesized to cause the variability in their indicators in the observed variables, and so that's why those one-headed arrows are pointing toward the observed variables and that's not the case in in PCA. Alright, so I think this is a useful distinction to make when we're talking about latent variables in SEM, very often, what we're talking about is is the latent variables from a factor analysis perspective. Okay, so here I've chosen to show you a path diagram that belongs to a model that's already been estimated. So we have all of the values here on these arrows because those are all estimates from the model, and I think that this diagram does a good job at illustrating why one might use SEM. First, we see that we have unobserved variables. Right here, conflict is an abstract construct that we can't necessarily observe directly and so we're defining it as a latent variable by leveraging the things that we do observe, in this case we have three survey questions that represent you know, that that unobserved Conflict variable. We are also able to account for measurement error, the way in which latent variables are defined in SEM assures us that we are, in fact, accounting for measurement error, because those latent variables are only going to sort of capture the common variance across all of these observed variables. Also notice that we are able to examine sequential relations in SEM. So we have this unobserved conflict variable but we're also able to see, you know, how does this Support variable, how does that influence Work and then, how does this Work variable in turn influence the latent variable? And ultimately, how Conflict, the unobserved variable, can predict all sorts of other outcomes. And so these sequential relations are very useful and very easy to estimate in SEM. Another good reason to use SEM is that in JMP Pro, our platform uses cutting-edge techniques for handling missing data, so even if you have a simple linear regression and that's really all you need, if you have missing data, SEM makes sure that everything that's present is being used for estimation, and so that can be very helpful as well. If this... if what I've said so far piques your interest and you plan on learning more about SEM, without a doubt, you're going to find a lot of terminology that is unique to the field. And so like anything else, there's jargon that we need to become familiar with. And so, this diagram is also useful to introduce some of that jargon. First, we've been talking about observed variables or measured variables. In SEM those are often called manifest variables. We have latent variables, which we discussed already. But we also have this idea of exogenous variables and those are the ones that are only going to predict other variables. In our model here, we only have two of those and they are in contrast to endogenous variables. And so every other variable here is an endogenous variable because they have other variables predicting them. Alright, so those are endogenous variables. We also have latent variable indicators and so these are the variables that are caused by the latent variables. And the residual variance that is not explained by the latent variable is called uniquenesses, and they're often called as well unique factor variances. And so remember that this is the combination of systematic variance that is unique to that variable in addition to measurement error. I find it useful when people are learning SEM to kind of have a shift in focus of what what the model is really doing. So in other words, by realizing that we're doing multivariate analysis of a covariance structure (and also means, but remember that we're not talking about means today), but by realizing that what we're actually analyzing is the structure of the covariances of the data, that helps sort of wrap our heads around SEM a lot more easily. Because it has implications for what we think the data are, right? So, for example, you know, we can have our data tables, where each row represents a different observation and each column is a different variable. And we can definitely use those data to launch the SEM platform in JMP. But in the background, sort of behind the curtain, what the platform is doing is looking at the covariance matrix of those variables, and that is, in fact, the data that are being analyzed. And so this also has implications for the residuals. Oftentimes when we think about residuals in SEM, those are with respect to that covariance matrix that we're analyzing. And this is also true for degrees of freedom, and the degrees of freedom are going to be with respect to this covariance matrix. Right, and so I want to make sure that I give you a little taste of how SEM works in terms of its estimation. And so the way we start is by specifying a model, and thankfully, in JMP Pro, we have this really great friendly user interface where we can specify our models directly with path diagrams, rather than having to specify or list, you know, complex systems of equations. You can simply draw the path diagrams that imply a specific covariance structure. And so the diagrams imply a covariance structure for the data and then during estimation, what we do is try to obtain model estimates that match the sample covariance matrix as closely as possible, based on the model implied constraints, basically. And once we have those estimates, we can plug them into the model implied covariance matrix and compare those values against the sample covariance matrix, and the difference between them allows us to quantify the fit of the model, right. So if we have large residuals, by looking at the difference between these two covariance matrices, then we know that we have not done a very good job at fitting our model. Alright, so, in a nutshell that's how SEM works, and I'd like to take now the next part of the presentation to introduce the data that we're going to use for our demo. I do think that it's easier to learn new concepts by getting our hands on some real data, real real world examples. So the data that we're going to use actually come from a very recently published article that was published in the Social, Psychological and Personality Science journal, and so this was published in the summer of 2020. And the authors wanted to answer a very simple question. They said, you know, how do perceived threats of COVID 19 impact well being and public health behaviors? And so it's it's a simple question, except for the fact that perceived threats of COVID 19 is a completely new construct, right. It's a very abstract idea. You know, what what is perceived threats of COVID 19 and how do you measure that, right? And so, because this is something that has never been measured before, the authors had to engage in a very careful study where they developed a survey to be able to measure those threats. And developing a survey is not easy, right. We need to make sure that our questions for our survey are reliable, that they're valid. And so they had to go through the process and we're going to see how they did that in a minute. Now, in their study they found that there's two types of threats that they could measure, one they called realistic threats, and that's things that threaten our financial and physical safety. And the other type of threat was...they called it symbolic. So those are things that threaten our social, cultural identity, right. And it's also important to say this sample was for the United States population. They sampled over 1,000 individuals and so their questions pertain exclusively to the United States population. And what we see here, this is actually the integrated COVID 19 threats scale, so this is the questionnaire that they developed after going through three different studies. And so they found that those two threats could be measured with a handful of items. They asked their participants to answer how much of a threat, if any, is the coronavirus outbreak, for your personal health, the health of the US population as a whole, your personal financial safety and so on. And for symbolic threat, the questions were, you know, how much of a threat is the virus for what it means to be an American, American values and traditions, and the rights and freedoms of the United States population as a whole, and so on. So you can see the differences in what these threats represent. So we had access to these data and we're going to use those data to answer very specific questions. First, how do we measure these perceptions of COVID 19 threat and we're going to focus on the two threats that they identified. And so, this is going to lead us to talk about confirmatory factor analysis and assessing a measurement model to make sure we can figure out if the questions in the survey are, in fact, reliable and valid. Notice we're going to skip over this very important first step, which is exploratory factor analysis and that's something that one would do before using SEM, right. You would run an exploratory factor analysis and then you come to SEM to confirm the structure of that...of the previous results. The the authors of this article definitely did that but we're going to focus on the steps that we would follow using SEM. The second question is, do perceptions of COVID 19 threat predict well being markers and public health behaviors. And so this this question is going to lead us to talk about multiple regression and path analysis within SEM. And the last question is are effects of each type of threat on outcomes equal. And this actually allows us to to show a very cool feature of SEM, which involves setting equality constraints in our models and conducting systematic model comparisons to answer these types of questions. Alright, so it's time for the demo, and I already have...let's see.... Oops, how do I get out of here? It's not time for questions yet. I just want to exit the screen and I can't seem to do it. Okay, here we go, so we have... I already have the data table from JMP open right here. This...these data, you can see there's 238 columns so that's because the authors asked a number of different questions from 550 participants in this case; this is one of their three studies. And the first 10 questions, the ten first columns that I have in the data correspond to those 10 questions we saw in their threats scale. And so, those are going to be the ones we use first to do a confirmatory factor analysis. And so we're going to click analyze, we'll go to multivariate methods, structural equation models. And we are going to use those 10 variables and click model variables, and then we're going to click OK to launch the platform. A notice that on the right hand side we immediately see there's a path diagram. And that diagram has already, you know, all of the features that we discussed earlier, so each of the variables are in rectangles, suggesting that they're observed variables. And each of them have these double-headed arrows that start and end on themselves, so they represent a variance of each of those variables. Now, if I right click on the canvas there's a show menu, and notice that the means and intercepts are hidden by default. I'm going to click on this just to show you that we do, in fact, have means estimated by default from all of these variables. And so we're not going to talk about those, so we're going to keep them hidden, but I do think it's important to know that the default model that we start with when we launched the platform is one where all of the variables have variances and means estimated. Now, on this tab, we have a list tab, and if we click on that we see that we have the exact same information that we have in the diagram but in a list form. And so all of the different types of parameters are split based on the type of parameter it is, so we have all of the variances here and all of the means over there. Right, we have a status tab and this basically tells us, you know, about the the specific model we have specified right now. It gives us a bunch of useful information about that model. We have details about the data, our sample size, the degrees of freedom, and we also have these identification rules. You can click on them if you want to learn a little bit more about them. It gives you a bit... A little description to the right. But what's really helpful to know is that this icon for the status tab is constantly changing, depending on the changes we we do and depending on the specification of the model we have. And so oftentimes, if we have an advanced application of SEM, this icon might be the color yellow and when we have a bad error, some some important mistake, then that icon is going to be an orange with an X, basically indicating that there's there's an error. So it could be very useful to identify mistakes as we are specifying our models. Now to the left side of the user interface, we see that we can specify the name of our model, so this is very helpful, sort of to keep track of our workflow. And we also have this From and To lists. And so, these lists provide a very useful way to link variables, either using a one-headed arrow or a two-headed arrow. So here, for example, if I want to link these, I can click that button and very quickly I've drawn a path diagram, right. So it's a very efficient way to specify models. And so I'm going to click on reset here just to go back to the model that we had upon launching the platform, but know that the From and To lists are basically ways in which we can draw the diagrams. Okay, in this case we have all of the observed variables listed here, but I know that we want to use those variables to to specify latent variables. Now, the first five variables here are the ones that correspond to the items in that survey for the realistic threat. And so I'm going to add a latent variable to the model by going down to this box down here, where it says Latent1 and I'm going to change the name to Realistic, because I want these five variables to be the indicators of a realistic threat latent variable. And so by clicking on this button, I immediately get that latent variable specified. And notice, the first loading for this realistic threat latent variable has a 1 in this... in this in this arrow, and that basically represents the fact that the parameter is fixed to the value of 1. And we do this because we need to set the scale of the latent variable. Without this constraint, we would not be able to identify the model and so by default we're going to place that constraint on the first loading of the latent variable, but we also could achieve the same purpose if we fixed the variance of the latent variable to 1. So which one we do is really a matter of choice, but as a default, the platform will fix the first loading to 1. Okay, so we have a realistic threats latent variable and the other five variables here are the ones that correspond for, you know, to the symbolic threats questions. And so I'm going to select those and click here. I'm going to type Symbolic and I'm going to click the plus button to add that symbolic threat. Okay, so we're almost done, but notice that this model here is, is implying that realistic and symbolic threats are perfectly uncorrelated with each other, and that's a very strong assumption. And so we don't want to do that. For the most part, most confirmatory factor models allow the latent variables to covariate with each other, and so I'm going to select them here, and I can click this double-headed arrow to link those two nodes. But I can also do it directly from the path diagram. So if I right click on the latent variable, I can click on add covariances and right there, I can add that covariance. So it's it's a pretty cool way. You can do it with the list, you can do it directly on the diagram, whatever is your choice. And so our model is is ready to be estimated, so I'm going to change the name to 2-Factor CFA and we can go ahead and run it. And you can see, very quickly, we obtain our estimates and they're all mapped onto the diagram, which is pretty cool. But before we interpret those results, I want to make sure we focus on this model comparison table. The reason is that table provides us a lot of information about the fit of the model, and we want to make sure the model fits well before we interpret the results. So the first thing to notice here is that we have three models in this...in this table and we only fit one of them. So the reason we have three is because the first two models, the unrestricted and independence models, aren't fit by default up on launching the platform. And so we fit these models on purpose to kind of provide a baseline for what's a really good fitting model and what's a really bad fitting model, and so we use those as a frame of comparison with our own specified models. So let me be a little more specific. For example, the unrestricted model would be a model (I'm going to show you with the path diagram), the unrestricted model is one where every variable is allowed to covary with each other, all right. And so notice that the Chi square statistic, which is a measure of misfit, is exactly zero, and the reason is because this model fits the data perfectly. Remember our data here really being the covariance matrix of the data right, and so we have zero degrees of freedom because we've specified... have zero degrees of freedom because we are estimating every possible variance and covariance in the data. So this is the best possible scenario, right. We have no misfit but we're also estimating every possible estimate from the data. The other end of the spectrum is a model that would fit really bad and that's what the independence model is. So if I show you with the path diagram, here our default model where we only have variance as a means for the data, that is exactly what the independence model is. And that is essentially a model where nothing is covarying with anything else, and you can see the Chi square statistic for that model is in fact pretty large, because there's a lot of misfit, right, so it's almost 2000 units, but we do have 45 degrees of freedom because we're estimating very few things from the data. And so again, these two models basically provide the two ends of the spectrum, right. On the one hand side, a really good fitting model and on the other side, a really poor fitting model, and so we're going to be able to use that information to compare our own model against against those. So, if we look at our model. Notice the Chi square statistic is not zero, but it is only 147 units, which is a lot less than 2000. And we have 34 degrees of freedom, so we do have some misfit. And when we look at the test for that Chi Square, it is a significant Chi square statistics, so it suggests that we have a statistically significant misfit in the data. However, the Chi square statistic is influenced by sample size, and, in this case we have 550 observations. And so usually, when you have 300 or more observations, it's very important to not only look up the Chi square statistic, but also at some special fit indices that are unique to SEM that allows us to quantify the fit of the model, and that's what the values are over to the right here. This first fit index is called the comparative fit index and that index ranges from zero to one, so you can see the unrestricted model has a one. That's the best fitting model and the independence model has zero, because the worst fitting model, alright, and our model actually has a CFI of .93, about .94. And so that represents the proportion of improvement from the independence model. So another way to say that is our model fits about 94% better than the independence model does, so that's actually pretty good. And usually we want to CFI values of .9 or higher. The closer to one, the better. Now the root mean square error of approximation is another fit index, but that one, although it also ranges from zero to one, we want very, very low values in that index. So notice the unrestricted model has a value of zero and the independence model is .27. We usually want values here that are .1 or lower for acceptable models. And ours has a .07, about .08, and that's actually pretty good. We also have some confidence intervals for this particular estimate, and you can see that those values are also below .1, so this is a good fitting model, right. And so once we know that the model has...fits our data well, then we can go ahead and interpret it. Now, as a default in our estimates, we are going to show you the unstandardized parameter estimates. But for factor analysis, it's much more useful to look at the standardized solution so I'm going to right click on the canvas and I'm going to show estimates standardized. And so, now the values here are in a correlational metric so we want those values to be as close to one as possible, because they represent the correlation of the observed variable with the latent variable. And notice, both for realistic and symbolic threat, the values are pretty good. We don't want them to be any lower than about .4, and so these values are good. Another thing that is very unique and really useful, it's unique to JMP Pro, is that the variables here the...any variable that's endogenous that has predictors, right, pointing at them, it's shaded. Notice here there's a little bit of gray inside these squares, and so that shading is proportional to the amount of variance explained by the predictors. And so it allows us to visually see very quickly which variables were doing a really good job at explaining their variance. In this case, it seems like these three variables are filled the most with that darker gray, suggesting that the symbolic threats latent variable is doing a pretty good job at explaining the variance of these three observed variables. We also see that the two latent variables are correlated about .4, which is is an interesting finding. And there's all sorts of output that we could focus on here in the red triangle menu, but I'm going to focus specifically on one option called assess measurement model. And this is where we're going to find a lot of statistics where we can quantify the reliability and the validity of our constructs. So if we click there, we have this nice little dashboard. And the first information we have here is indicator reliability, so this quantifies the reliability of each of the questions in that survey and we provide a plot that is. showing us all of these values. And notice, we have a red line here for for a threshold of what we hope to have, right. We want to have at least that much reliability in each of our items. Now, you know, these types of thresholds need to be interpreted, you know, with our own critical thinking, because obviously, you know, this this particular item, for example, is is below the threshold, but it's still pretty close to the threshold so we're not going to throw it out. We can still consider it relatively reliable and and it's still a good indicator of this latent variable. But again, just interpret the thresholds here with caution. But one thing that is apparent from this plot is that the symbolic threats latent variable appears to have more reliable indicators than the realistic threats. They're both pretty good, though, but the symbolic one, you know, we're doing a better job of measuring that. The values to the right are reliability coefficients for the composites. In other words, they quantify the reliability of the latent variable as a whole and there's two types of reliability. I'm not going to get into the details of their differences but notice these values range from zero to one and we want them to be as close to one as possible. And we also provide you know, some plots with the threshold of sort of indicating what's the desired amount of reliability that we want, the minimum and, in this case, both realistic and symbolic threat have good reliabilities. And the other visualization we have here is for a construct validity matrix and keep in mind that when you're trying to measure something that you don't see directly, it's very hard to figure out if it really is what you intend it to measure. Are you really measuring what you wanted? And that's what this information allows us to determine. The the visualization here is portraying the upper triangular of this matrix, and let me just explain briefly what the values represent. In this lower... the below the diagonal, we have the correlation between the latent variable. That's about .4. The diagonal entries represent the average amount of variance extracted from the...that the latent variables extract from their indicators. And so you want those values to be as high as possible. And above the diagonal we have this squared correlation. In other words, it's the amount of overlapping variance between the latent variables. And so the key to interpreting this matrix is we want values in the diagonal to be higher than the values above and to the right of the diagonal, and notice here, the visualization makes it very easy to see that we do, in fact, have larger values in the diagonal than we have above or to the right. And that is good evidence of construct validity. And so, everything here is suggesting that both the realistic and symbolic threats are, in fact, latent variables that that are valid, that are reliable, and the survey seems to do a good job of measuring both of these. So a next step might be, we could choose perhaps to grab all of those five questions that represent the realistic threats here, and we could create an average across all of these. And all of a sudden, we would have one measure that represents realistic threats. We could do that and we could do the same for the other five variables that represents symbolic threats. And so let's just for illustration, I actually have already created those variables, so let's go to analyze, multivariate methods, structural equation models. And I'm going to look for those average variable. So that realistic and symbolic threats here, these are the average across the columns for each of these variables. And I'm going to model those in addition to...we have a measure for anxiety, we also have a measure for negative affects or negative negative emotions. And lastly, we have a measure for adherence to public health behaviors, and so we're going to model that as well, and we're going to click OK to launch the platform. The diagram buttons here, we can go into a customized menu, we can change all sorts of aspects of the diagram, which is really, really great. Right now, I'm just going to focus on increasing the width of... the width of these variables, so that we can read what's inside the nodes. And what I'm going to do is fit a model where both realistic and symbolic threats are going to be predictors of these interesting outcomes, right. There's sort of markers for, you know, anxiety, negative affect, and also the public health behavior, so we're going to link these with a one-headed arrow to specify the predictions. So we're going to investigate whether these effects are, in fact, significant. Now notice I'm not fully done specifying this model yet, because in this particular model, there's no connection between the realistic and the symbolic threats, and that would be a very strong constraint in the model to say that these two things aren't covarying at all. And so we always want to make sure that we include covariances between our predictors and also between the the residual variances of our outcomes. And so we could specify those directly from the From list, in this case I'm going to use add covariances from this menu, and I'm going to link the realistic and symbolic threats. I'm also going to use the lists to add covariances between the residuals of these outcomes. And now we have a full correctly specified model, and this is often called path analysis but it's also... it's basically a collection, a simultaneous collection of regression models and so we're going to run that. And notice from the model comparison table that our model has no degrees of freedom and has a perfect Chi Square and the Chi square is zero. But essentially by having zero degrees of freedom, it means that our model is not testable, because we've extracted all the information we could have extracted from our data. So that's essentially what we do when we fit a regression model. So there's no problem with that, but just know that you can't interpret this Chi Square and say, oh my model fits so well. It fits well because you've extracted everything you could have extracted from the model. So anyhow, it's just like a regression model. Alright, so if we go and look at the results, you know, there's all sorts of really important information that we can interpret but I'm going to focus on a couple of things. First, notice our diagrams are fully interactive, which is really, really cool. And I'm just moving things around to focus on a couple of effects. I'm going to hide the variances and the covariances in this model, so that we can really focus on the results for from the regression models, from the path analysis. And notice here, so realistic and symbolic threats, both of them have a positive effect on anxiety. So that's really interesting and here the arrows are solid because the effects, you can see here in the table of parameter estimates, are statistically significant. So if they were insignificant, actually, the arrows would show up as dashed arrows. So the path diagram conveys a lot of information. So we have positive significant effects on anxiety and that's interesting, of course, but so far, all we've done, again, is is fit regression models in a simultaneous way. And, in fact, if we go back to the data table, I have a script here from fit model, where I actually use that same anxiety outcome and the same two predictors, realistic and symbolic threats, and I simply estimate a multiple regression model. And the reason I wanted to show you this is because, notice the parameter estimates here are exactly the same value than we obtained from SEM. And that's no surprise, because, in fact, we are doing a simultaneous regression. So up until this point, you might wonder what does SEM buy you, right, because technically, you could run three separate fit models with these three outcomes and you could obtain the same information that we've obtained so far. However, well if you have missing data, you still want to use SEM because then we're going to use...all of the data are going to be used rather than dropping rows. However, if you want to use SEM you're also going to be able to answer additional questions that are pretty interesting. In this case, we might wonder whether the effect that realistic threat has on anxiety is statistically greater than the effect that symbolic threat has on anxiety, right. So far, we know that they're both significantly different from zero but are they significantly different from each other? And that is a question that we can answer by using the SEM platform. And going back to our model specification panel, we can select both of those effects and up here in the Action buttons we have the set equal button, so if we press that, notice we get a little label here that implies that both of these effects are going to be set to equal. They're going to be estimated as one. And so if we change the name here to equal effects and we run this model, we're going to obtain the fit statistics for that specific model that has the equality constraints. And notice we've now gained one degree of freedom, so all of a sudden, we have a testable model. And we can use the model comparison table to select the two models that we want to compare against each other, and clear your click compare selected models. And now we obtain a Chi square difference test, so we're able to compare the model statistically and see what is the amount of misfit that our equality constraint induces in the model. And here we can see it's it's about 8.6 units in the Chi square metric and the p value for that is, in fact, significant. So this suggests that setting that equality constraint induces a significant amount of misfit in the model. And we also, because we know Chi square is influenced by by sample size, we also have the difference in those fit indices that we discussed and for the CFI, we usually don't want this to increase, any more than .01 at the most. In fact .01 or higher is is not so good. And for RMSEA, you know, you don't want this to be any higher than .1. So all of the evidence here suggests that setting the equality constraints leads to a significantly worse fitting model. In other words, if we go back to the model that fit best, we're now able to say, based on that Chi square difference test, that the effect that realistic threat has on anxiety is significantly higher than the effect the symbolic threat has on anxiety. And so those types of questions, you know, we could address them with other parts of this model, but again SEM affords a lot of flexibility by allowing us to compare the equality of different effects within the model. Okay, and so, in the interest of time I'm going to close this out, but, and I do want to show you so far, you know we saw a confirmatory factor model. We also saw a path analysis, where we're doing a multivariate regression analysis, but we can actually use both of those concepts in one model. And so I have a script that I've already saved in my data table, and you can see what I'm doing here in this model is actually estimating latent variables. I'm modeling latent variables for both symbolic and realistic threats, using the original items from the survey, from the questionnaire. And so, by doing this, instead of creating averages across the columns, I'm actually going to model the latent variables, and that allows me to obtain regression effects, all of these effects amongst latent variables are going to be unbiased and unattenuated by measurement error, because I'm obtaining a more a more valid, a more pure measure of symbolic and realistic threats. And so here we are estimating, you can see sequential relations, and my model here is a lot more complex. I'm not going to get into the details of the model, but just know that by modeling latent variables and looking at the relations between latent variables, we're really able to obtain the best functionality from SEM because our associations between those latent variables are going to be better estimates of... for the model. And I actually estimated this model, you can see the results down here. And so notice here, there's a few edges that are... that have arrows that are sort of dashed, indicating that those effects are not significant. We also see how powerful the visualization of the shading is, right. We're able to explain some proportion of variance of adhering to public health behaviors. And it seems like we're doing a better job of explaining variance on on positive affect than we are on any of the other outcomes here. And so again it's basically, the best of both worlds, being able to specify our latent variables but also model them directly using our platform. And so, with that I'm going to stop the demo here, but I will direct you to the fact that in the JMP Community website, we have supplementary materials that James Koepfler has created. They are really great materials that have a lot of tips on how to interpret our models, how to use the model comparison table, and basically all the notes that you would have wanted to take during this presentation, you can get them on the supplementary materials. And so with that I am ready to open it up for questions.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Christel Kronig, Senior Analytical Scientist, Dr Reddy's Laboratories EU Ltd Andrea Sekulovic, Scientist Formulation Process Development, Dr. Reddy's Laboratories Ltd. A key aspect of the development of generic drugs is that sameness to the innovator product must be demonstrated. In this study, the objective was optimisation of a milling process to generate a drug product with particle size attributes all in the same range as that of the innovator product. Two different modelling techniques were evaluated to model the particle size attributes over time: Functional Data Explorer versus Fit Curve. The curve parameters or Functional Principle Components (FPC) were then modeled as a function of the process parameters. Finally, the models obtained were used to predict the particle size attributes over time and identify combinations of process parameters likely to generate a drug product of the desired quality. A verification experiment was performed which resulted in a product with particle size attributes matching the requirements. Auto-generated transcript... Speaker Transcript Christel Kronig Well hi everyone, and thank you for joining this talk. My name is Christel Kronig and I'm a scientist at Dr. Reddy's in Cambridge in the UK. I helped with the data analysis on this project. My colleague, who also worked on this study, is Andrea Sekulovic and she's based at Dr Reddy's in the Netherlands and she's a formulation scientist. And so today I'm going to talk to you about the optimization of a milling process to match the drug product quality attributes of the innovator. And so the first part of the presentation will be will be talking to you about the process development, of the objective of the project and what the study involved and and what modeling options we considered, and then the second part, I will look at the workflow that we developed in JMP for this study. Okay, so and the objective of this study was really to understand the relationship between the process parameter for milling process and any quality attributes for the drug product that we're making, so our responses. So we wanted to obtain a predictive model that we could use for scale up and also to optimize the conditions that we would need for this process. So there were several responses that we had to examine as part of the study and they are particle size attributes. So we looked at micron and span and by studying the innovator product, we also knew which...what the range needed to be to make sure we had a product that was within the specification and similar to the innovator product. So the profile of the responses would vary, based on the milling time and the milling process parameters to find the optimum conditions we needed to optimize these parameters to make sure that we would have product in the design range that met the specification requirement. milling speed, the flow, the size, we also had a loading parameter, excipient percent, API concentration and, of course, the time for milling. So the process development was... we started with making some initial batches. We didn't start directly with doing a design of experiments. We looked at the data that needed to be collected with those first few batches that were made and we looked at modeling options. And then the team in the Netherlands decided to do an I-optimal design, so they looked at three parameters and time and perform some initial modeling, after those first data sets, decided to add an extra three additional parameters. So we documented this design, added 10 additional experiments. So the final data set that we looked at for the optimization had 38 batches that included six parameters and time, and this is what we use for the optimization for this study. So after that we then made some confirmation batches to check if the new settings would generate products that meet the requirements. Okay, so what modeling options did we consider for this project? So the default option for the team was really to model the response at selected time points. And it's easy to do that in the standard software and the disadvantage of course is it's not possible to predict the outcome at other time points and the optimum may be in between specific time points. So modeling of the profile over time enables greater understanding of how the process parameters affect profile of the response over time, so you're more likely to reach an optimum. But for this, of course, you need more advanced modeling capabilities. And so we looked at first fit curve, which is available in JMP. And for our initial data set that works quite well, so this is one of the functions, this Biexponential 4, that we used appeared to be a good fit for most of the batches that we've made initially when you know modeling some of the responses and there's an example on the right of how the this type of curve fitted our data quite well. And one of the issues we encountered is that it didn't work for all the batches so, for example, in some cases we didn't have enough time points. On the left there's not enough time points to see...to fit that model. You would need a minimum five, for example, for this particular model. On the right we have one way we have enough time point but that particular type of curve doesn't fit this particular model very well. So it, it was difficult using this for the larger data set that we had, and so, for that reason we didn't continue with that approach. And we also looked at Functional Data Explorer. So you can see here, looking at this platform for 10 different batches, you can see those on the left and how the profile was fitted... profile over time, so there was no issue with lack of data points here. This gave a good view of differences between batches. So, for example, in green on the graph in the right hand side, you can see the fast batch is in this part of the space and the slow batch appears in a different part of the space and that highlights the difference between those profiles which is perhaps not so obvious when you look at the graph on the left hand side. So what you get with the Functional Data Explorer is it breaks down the profile over time into different principal components, so the FPC values that we see here, and this is what you then use for the next part of the workflow. So this is available in JMP Pro; I forgot to say this. So, first, before I show you what it looks like in JMP, I just wanted to take you through what the workflow looks like, what we are trying to do. So we're starting with...apologies for that...we're starting with a timetable which has this critical quality attributes, responses, we have our time points and then for each of those time points, and yes, the critical quality attributes, and also the process parameters that we'll use to generate the batches. So you then take that data and you use FDE to get a model, and this will mean that you can express your CQAs, your responses, as a function of time and those functional principal components FBCs. So the output of that will be that you then get a summary table. For each batch you have CPPs and the FPC, functional principal component, and then you can apply standard modeling to then get the predictive model to express those FPCs as a function of your process parameters. And then the final step is to import that model back into the original table, so you have then...you can express then your responses as a function of time and your process parameters, which allows you then to use this to find a optimum conditions and to make confirmation batches using those models. And so, and what I've got to say is that for modeling I'll show you this in JMP, we use this model validation strategy for designed experiments, that's something I was presented three years ago at Discovery Summit. I won't go into the details of that, but that's there for reference if you want to look that up. So okay so I'll now take you through what this workflow looks like in JMP and I'll switch over to JMP so...just find my JMP journal. Okay, going to move this here. Okay, so we'll first start with the original data table, so we have a number of batches that were made, so 38 batches. Each batch has a number of time points. For example, the first batch here, we have 10 different time points. You have your six columns which are the process parameters. And then we have two responses for each of those data points. So first thing to show you is to look at the data table and visualize that data set, what what that looks like. So I have a script here using Graph Builder, which gives very quickly a good overview of what the data looks like. So, for example, you have one of the responses with the milling time here at the bottom, and you can straightaway see that the profiles are quite different, depending on the batches, some are steeper than the others, some are very shallow and also some were collected over a longer or shorter period of time. So we'll now look at the profile in a bit more detail and look at the two modeling approach that we talked about previously in the slides. So the first one is using the fit curve, so that's under the analyze platform, under specialized modeling. So if I select fit curve. I'm going to pick my milling time with my X and one of my responses, and then I'm going to select batch. So I'm going to do that for each batch. Click OK. I then have a...for each batch I have a profile, the response over time. And then I'm going to use one of the models that is stored, you know, that JMP has already, and this is this biexponential 4P, which I talked about before, so I won't go through the differences, you know the different models, but I know this is one of the ones that we looked at previously for our data. So for example, for this batch it fits okay, but not not brilliantly for some of the data points. For this one, there's not enough time points so you don't, you know, you can't really use that, but for some of the batches, so this one, for example, that fitted really well, so you had the four coefficient, you know, estimates and they were statistically significant. So what you would do, then, is just export all that data and get a summary table in the same way that we're going to do for the Functional Data Explorer but... So I won't do any more using the fit curve in this demo, but the same approach could be used. So let's go to the Functional Data Explorer, so it's also in specialized modeling in JMP Pro. And I'm going to select my response and my milling time and also my ID is my batch number. So I'm going to not explain again in a lot of detail, bearing in...bearing in mind the time we have, what modeling to use for this type of data. I know B-Spline works quite well with my data set, so this is what I'm going to use. And as you can see JMP fitted a model for each of the batches that we have in our data table, seems to fit quite well. So if I look further down, you can see that it's broken the profile into different components. And you can see on the graph here where the batches are in spec. Now for this particular response, FPC1 here the top actually explains 96% of the variation in the data, which is pretty good so we wouldn't need, in this case, to look at FPC2 and FPC3. In this instance, we probably only need to keep the first one. It wouldn't be the case for, for example, for the other response that we have in this data set, but here I'm going to restrict the number of FPCs to one. And then I'm going to export my summary data, so when I click on save summary, I have a new table appear in JMP. This is so, you can see, 38 rows, so I have one row per batch, I have my batch number, I have my FPC value, and I have some prediction formula. So I'm going to close this table now and use one I've prepared earlier, which has got the FPC for all the responses that I want to look at in my data set. Let me do that and switch back to my journal. Okay so we're now at step two, where I have a summary table that I've prepared and I have the columns for each of my responses for my FPCs. So the first thing I want to do here is to use this validation technique for DOE, where I'm going to create extra rows, which I'm going to use for the validation report. So I gave you the reference in the slides if you want to understand more about that technique, you can do that. So we're then going to fit a model for one of my FPCs. We shall want to examine as a function of some of my process parameters. And click on run. And I'm going to use a stopping rule which is minimum AICc. Click on Go. So JMP has found several process parameters and also interactions that it's found important. You can see the R squared and R squared adjusted look good and you get also by using these...adding this extra R square validation, which also indicates that it's looking good and the model hasn't over fitted, for example. So I'm going to click on make model and then I have a model where I can see that milling speed and size was important and some of the interaction terms also important, so what I need to do next is save the prediction formula. And I can also save the script for when I want to do that again later on and save that to my data table. So I'm going to close this window. So you would need to do this exercise of fitting the model for each of the FPC values that you have for your data table. And the last thing that we need to do is to use our prediction formula, so this is the formula I have just saved for FPC1. And this is the prediction formula that came from FDE earlier on. So if I right click, this is what it looks like. So I have the FPC1 for each batch and then I have those extra columns which are functions of time. So what I need to do now is, instead of FPC1, the actual value, I'm going to use my prediction formula, which now is a function of my process parameters. Just literally replace that in the formula here. I'm going to click apply. OK so again, you will need to do that for any of the models that you generate and then save those in your data table. So I'm going to close this and we'll go on to the next step in the journal. Okay, so what you need to do, then, is import back these formulas that we've just saved into the original table, so we can see how the response vary as a function of time and the process parameters. So I'm going to open my time data table and I'm going to open my summary table with my model. So I'm going to select those columns with the models and I want to copy those columns into my original table, which I've now lost. There it is. So I'm going to click paste columns now. And I just want to double check that it's copied my formulas across. So, for example, if I go there, yes, my formula has been copied. So what what I now have is the model which predicts the response as a function of the time and the process parameters. So I'm again not going to save this but use the final table, which has all the models that I need to then do the optimization. So this is the last step. So I now have again my process parameters, my milling time, the two responses and then the prediction formula, which would have come across from the summary table. So I'm going to use these two columns now and with the profiler. And what I need to do also is look at factor grade, so the team wanted to set the API concentration, excipient and size to specific settings that they wanted to find the optimal conditions for. So if you lock those settings, click OK. And then you can use the desirability functions to set the specification limit, which I have done already. And then maximize desirability and then these would get JMP to find the best condition to provide product that would meet the requirement that were set, that we wanted. S this is a technique that we used. So I'm going to come out of JMP now and go back to the slide for just to show you what the outcome of this workflow was. So I'm going to switch back to screen. Okay. So out of the 38 batches that we made, that the team in Holland made, there were only four where we had one time point, at least one time point, where both of those results were within the range that was set. But you can see in the table here on the right, that for the span response that in all four cases, this was very close to the upper specification limit, and the team were really interested in finding conditions that would generate product where the that particular response was, you know, well within the range whilst maintaining the other response, also within the target range. And yeah, we had a great result. The the model with conditions that were selected predicted the span of 1.63. The actual result for this batch was 1.78 and that was the lowest span that was achieved of all the batches made. So the team were really happy with that result. So despite the slight underestimation of the model, this was still a pretty good result. And you can see in the screen here where this batch appears in green and it's completely to the left of all the other batches. And this is why, you know, we were able to achieve a good result. I guess it was, you know, using a slightly different combination of parameter that enabled this result to be achieved. So just a conclusion really that the Functional Data Explorer in JMP Pro worked really well for this application. It yielded a good predictive model and the best result to date. We couldn't make use of the fit curve approach so well, despite the initial promising results that we'd seen at the beginning of the study, and we couldn't use that for the whole data set, but nevertheless, the team was convinced of the value of looking at profiles over time and the value of this approach. And of course you can apply this to other types of data, for example in formulation, you know, in vitro release or in API development reaction conversion, for example. And so, this is the end of the presentation, thank you to colleagues in the Netherlands that were involved with milling lots of batches, and to Andrea who's the co author on this presentation for great teamwork. And we had good interactions between both sides so that led to some great results. So thank you for listening and hope you enjoyed the presentation and enjoy the rest of Discovery.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Peng Liu, JMP Principal Research Statistician Developer, SAS Jian Cao, JMP Principal Systems Engineer, SAS This talk will provide a comprehensive review of major updates in two time series-related platforms. More specifically, the updates include a forecasting performance-based model selection method, enhanced functions for studying the recently added state space smoothing models, and analysis capabilities using Box-Cox transformed time series. We will explain the motivations behind development efforts to help identify interesting use cases of the new features. We will present a few examples to illustrate some of the many possibilities for how these new features can be used. JMP 16 represents a major upgrade for time series platforms. Equipped with the new features, JMP opens the door to many intriguing new discoveries in time series analysis. Auto-generated transcript... Speaker Transcript This talk is to highlight some 00 11.600 3 time series platforms. Three are from time series platforms and 00 34.700 38.400 10 11 12 do we need Box Cox transformed time series? Let's take a look at the data 00 00.933 06.200 18 also known as a as an airline passenger data set. The original series is 00 23.900 23 model from. Why? Let's take a look at a plot... 00 43.433 27 getting larger. And this series cannot be handled by the 00 01.666 31 in the second picture. So the variation does not change with the various times series ??? 00 20.266 36 So in the literature people will say, well, we will transform 00 33.800 38.700 41 42 43 of the transform scale, in this case here, it's the log scale. Sending it to inverse 00 04.366 49 transform. In the past...in the past 00 21.966 53 streamline the whole process. What you need to do is to put 00 39.133 57 need to do the models, make forecast, then the software, 00 55.933 61 will put log passengers into Y, but now we don't have to. We 00 15.033 66 to enter the Box Cox transformation parameter Lambda. Zero, it means it's a log 00 28.333 31.166 37.433 73 the red triangle menu and click either ARIMA or seasonal ARIMA. 00 01.766 78 12 for seasonal part. Without intercept. Click 00 19.633 22.933 24.500 28.733 85 86 forecast taking care of the inverse transformation. The 00 49.600 91 will will show in this. plot and the forecast had 00 05.033 08.800 11.100 17.866 98 models is a workhorse in time series forecast platform. They can fit and forecast a lot 00 35.733 39.633 104 performance is somehow comparable to the forecasting 00 59.366 110 111 study why it...why this type of model works and why some some 00 18.600 115 116 type model into the time series platform which is designed to 00 43.900 120 a function of the unknown state, unobserved state. Here at 00 01.000 124 variables and the error term by either additive operations 00 16.266 19.200 129 130 state is the level state time series. Trend state forms a 00 40.733 134 135 state, and also one of the previous seasonal states. And 00 59.733 03.133 140 141 previous trend state will tremd to the next. trend state and the level state 00 25.966 146 point to another time point. And there are more arrows...that there are more states transitions than is 00 49.700 152 series into Y and click OK. To fit this type of model, we 00 12.300 16.766 160 161 set, I'm going to enter 12 for period. And I'm going to click Select Recommended button. From the additive error models and 00 47.666 169 this particular set, I'm going to click constraint parameters 00 05.966 173 recommended models to fit these type of...these time series and 00 23.666 27.966 179 180 model with smaller AIC and my eyes are on the first two 00 52.300 184 models. And let me overlay the forecast 00 09.366 10.566 17.666 190 191 192 193 from the original time series more nicely. So in my preference, I would 00 45.133 198 difference? Let me open the first one MAA...MAM. Let's go down below. This 00 05.533 10.666 204 this one component states. This is a special for this 00 32.900 209 the first letter. And the trend is additive by 00 50.233 213 second part of this report are the...are the state component 00 06.100 217 part is the prediction of this specific state. The period of the time series 00 27.500 222 223 has an increasing pattern in the past. It keeps increasing 00 48.266 227 series and the pattern continues toward the future, and this 00 03.466 231 observed, but the forecast is flat. This bothered me. Now let's look at second 00 17.333 20.366 23.566 27.933 239 240 state component graph. Level is increasing in the past had 00 44.366 48.466 245 246 247 248 future. This is more reasonable plot that I can accept. So is it 00 14.866 252 on to the second slide. This slide and then the next 00 45.100 259 on interpreting the forecasts from from this model...this type of models. Here I would like 00 01.933 07.233 265 up. I listed half of them here. Oh, nearly half. So let's focus 00 27.033 30.800 271 some increase trend and will taper off towards the end. And on the other hand, we can 00 54.233 277 see from the forecast using this type model. If seasonality is not involved. When I 00 11.066 14.200 19.566 284 285 286 the first one, this is a flat forecast if the seasonality 00 46.833 51.266 291 have a linear increase patern and so on so forth similar to the others. Now 00 06.800 12.566 297 it's merely increasing. After applying the 00 34.033 302 the multiplicative seasonality on the top of our increasing 00 48.966 306 this type...different type of shapes flat 00 05.666 10.733 311 312 we get those different...different shapes. So I I re entering ??? 00 33.466 40.433 318 what we eventually see in the forecast. You have the flat patterns or 00 00.733 323 parameters. So I separated these parts and also I 00 14.866 20.400 328 329 330 trend will usually look flat, we will get an increase pattern in the level state. When it's linear and when it's curved. It's all depends on how this 00 52.233 57.666 339 increasing or decreasing in the level exponentially. So this is 00 13.466 343 is lean and think of it as compound interest rate if if the level state increase 00 32.466 348 they make forecasts, they try to... try not to overshoot or undershoot the forecast that 00 59.833 355 how to interpret the forecast from state 00 20.133 359 second one, none of of these models are stationary. They are 00 37.466 364 So if you are considering these time series. Things 00 55.566 59.600 369 third one, if you just see that time series not 00 12.000 373 374 a result in a...in the next slide that will fit 00 31.933 378 compare across type of model be careful. This slide is to show how... 00 51.800 383 is the forecast. And similarly, I plot my 00 07.400 387 apply these type of state space smoothing models to stationary time series? Here I simulate a 00 32.633 393 394 395 models to this time series, the best model turns out to be in an 00 54.266 399 400 rather different becauses it is a random walk model and the 00 13.466 16.466 18.733 22.333 408 feature in this presentation forecast on holdback. This feature allows you 00 37.966 40.500 45.833 415 416 417 418 one is from another model. And then you can compare these 00 11.833 423 424 to activate this feature. Then I need to specify the length of the holdback 00 35.200 38.466 430 431 432 click Select Recommended, and check Constraint 00 06.433 438 439 portion of the series, we listed the holdback 00 26.000 443 444 by default, but you can always change the metrics you 00 39.900 44.866 449 reports are similar to to that got from the analysis results without activating this 00 05.433 454 455 let's let's let me summarize what we have learned from 00 25.566 459 performance over the holdback data. But those criterias are 00 41.366 463 process. We see the rather different from how we use 00 57.700 467 part of the model fitting process, so this is something 00 17.000 471 holdback to evaluate different models based on their forecasting performance. So we 00 32.033 35.366 478 column is that time series indicator. Y is time series 00 55.666 482 summarize the data set, either time or time series, by 00 09.533 12.066 16.600 488 489 specification or change the model selection strategy, we 00 39.200 494 check...change selection in the first combo box to forecasting performance. Then we can choose forecasting performance 00 04.633 500 we want forecast. But you can choose any...change to any 00 17.833 25.466 30.866 506 507 using the training time series, select the best 00 49.800 52.333 54.766 57.900 00.966 515 series platform. First analyze Box Cox Transformed time series. The second one is fit state 00 27.666 521 522 as well and using it as And model selection method. Thank you very much.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Hadley Myers, JMP Systems Engineer, SAS Chris Gotwalt, JMP Director of Statistical Research and Development, SAS The calculation of confidence intervals of parameter estimates should be an essential part of any statistical analysis. Failure to understand and consider “worst-case” situations necessarily leads to a failure to budget or plan for these situations, resulting in potentially catastrophic consequences. This is true for any industry but particularly for pharmaceutical and life sciences. Previous work has explored various methods for generating these intervals: Satterthwaite, Parametric Bootstrap and Bias-Corrected (Myers and Gotwalt, 2020 Munich), and Bias-Corrected and Accelerated (Myers and Gotwalt, 2020 Cary), which were all seen to have error rates that were too high for the small samples typical in DOE situations. Therefore, we make use of the new “Save Simulation Formula” feature in JMP Pro 16 in an add-in that improves upon these by allowing users to perform a “Bootstrap Calibration” on the Satterthwaite estimates. The add-in also includes the ability to do this for linear combinations of random components, taking advantage of another addition to JMP Pro 16. Further, we investigate a new version of the fractionally weighted bootstrap that respects the randomization restrictions of variance component models, as an alternative to the parametric bootstrap, using the new “MSA Designer” debuted at this conference. Auto-generated transcript... Transcript Hello, my name is Chris Gotwalt. My co-presenter Hadley Myers and I are presenting an add-in for obtaining improved confidence intervals on sums or linear combinations of variance components. This is part of a series of talks we have given as we work on improving and evaluating several approaches. Obtaining confidence intervals on sums of variance components is important in quality because it provides an uncertainty assessment on the repeatability plus the reproducibility of our measurement system. The problem is that when we ask for a 95% confidence interval, there are approximations involved and the actual interval coverage can be as low as 80%. In our previous studies, we found that two methods have improved coverage rates, parametric bootstrapping and Satterthwaite intervals, but it was still less than 95% in small samples. The earlier version of the add-in implemented the parametric bootstrap as a stopgap and Elizabeth Claassen implemented the Satterthwaite intervals in the Fit Mixed platform natively in JMP Pro 16. I want to stop here to give Elizabeth Claassen credit for making interval estimation of linear combinations of variance components so much easier and JMP Pro 16. She also greatly extended the Mixed Model output, which has made this presentation vastly easier. I'm also hoping that this presentation will serve as an inspiration to others to check these new Save commands out so they can get more from JMP Pro’s Mixed Model capabilities. Now we're going to combine the two approaches using a technique called Bootstrap Interval Calibration that was introduced by Loh in a 1991 Statistica Sinica article. Bootstrap Calibration is a very general procedure for improving the coverage of confidence intervals that can be applied to almost any parametric statistical model. I'm going to introduce the basic idea of Bootstrap Interval Calibration in the simplest terms that I can, and hand the mic over to Hadley, who's going to demo the add-in and discuss our simulation results. To make this simple, let's make it specific. Consider a very small nested Gauge R&R-type study where we want to estimate the total variation. We collect the data and run a nested variance components model with an Operator effect, a Part within Operator effect, and a residual effect. The software reports a Satterthwaite-based interval on the total. It's well known that this is an approximation that assumes a “large” amount of data is present in order for the actual coverage of the interval to be close to 95%. In small samples, the actual coverage, the probability that the interval procedure generates intervals that actually contain the true value of the estimated quantity, will tend to be less than 95%. Thing is the actual interval coverage is a complicated function of the design, true values of the functions, and a long list of other assumptions that are hard or impossible to verify. What we can do though is used the fitted model and their parameters to do a parametric bootstrap. When we do this, we know the true value of the quantity we are estimating because we were simulating using that value. We can do the simulation thousands of times. We apply the same model fitting process to all the simulated samples. We can collect the intervals from JMP and calculate how often they contain the generating value of the quantity that you were interested in. In this case we were interested in the sum of all the variance components, so the true value is 4.515. Suppose we took our original data set, took the estimates, use the Save Simulation Formula that is comes from Fit Mixed, and generated a large number of new data sets, and applied the same model fitting process that we applied here to each of them, and we collected up all of the confidence intervals that were reported around the total. After having done this, suppose that that...the estimated coverage, the estimated number of times that these intervals actually contained the truth, turned out to be 88%. So we wanted that 95% interval, but the Bootstrap procedure is telling us that the actual coverage is closer to 88%. Now we can play a little game and we can repeat the Parametric Bootstrap using a 99% interval this time. So we go through that process, we redo all the bootstrap intervals and when we did the 99% interval we get an actual coverage of approximately 98%. Now suppose we did this game over and over again until we found an alpha with actual coverage approximately 95%. So in this case, suppose we did that and we ended up with finding that 97.6% when we asked for a 97.6% interval, we actually got something like a 95% coverage. Then what we can do is set 1 minus alpha to 0.976 using the Fit Model launch dialogue, set alpha option and will get an interval that has been Bootstrap Calibrated to have approximate coverage 95%. This is still an approximation. There is still a simulation component to it, as well as a deeper underlying approximation that is extraordinarily hard to analyze, but it can be made easy to use, and this is where Hadley comes in. Now I'm going to hand it over to him and he will demo the add-in and go over the simulations that he did that show that we are able to get better coverage rates than before by applying Bootstrap Calibration to Satterthwaite intervals on linear combinations of variance components. Take it away Hadley. Thank you very much, Chris, and hello to everyone watching online wherever you are. So I'm going to start out by showing you how the add-in works and how you can use it to calculate Bootstrap Calibrated confidence limits for random components in Mixed Models in JMP Pro 16. And from there we'll take a step back. We'll see how the add-in makes these calculations and I'll highlight some of the additions to Mixed Models in JMP Pro 16 that allow it to do that. From there, I'll show you the results of some simulation studies to give you an idea about how accurate this interval estimation method is, the Bootstrap Calibration method, and how it compares to some of the other methods for calculating confidence limits, as well as the situations where it's more or less accurate and some of the limitations and things you should be aware of if you're going to be applying it. We’ll discuss possibilities for improvements in future work just briefly, and from there, I'll conclude by showing you the new MSA Designer, Measurement Systems Analysis Designer, available from the DOE menu in JMP Pro 16 so that you can quickly and easily design and analyze your own MSA Gauge R&R studies. So let's start out by looking at this data set. This is one that I pulled from the sample data files. I'm going to run this Fit Mix script here that I've saved. So what we've got here are our random estimates, estimates for a random components. Now. it could be that you want to, for some reason, calculate an intermediate total, for example Operator and Part nested with Operator, or the three of these, you know, Operator and residual. So to calculate those is very simple, we simply add these estimates, but what's not so simple is to determine those confidence limits. There's a new feature in Mixed Models that's been added in 16. The linear combination of The linear combination of variance components feature right here, and so what you can do is you can click that. You can choose the combination of variance components that you're interested in, and you can press done. So now we have an estimate for those. Components as well as their confidence limits. So, what I'm going to do now is I'm going to take this one step further and I'm going to calculate the Bootstrap Calibrated Satterthwaite estimates and I'm going to do that by going to my add-ins and clicking the Bootstrap Calibrated confidence intervals there. So from here we can estimate the number of simulations. 2500 is a recommended number to the default number. It's also the default number in some of the other simulation platforms and in JMP Pro. I'm going to choose this one. But one thing to note is that it takes some time to be able to do this, and so in the interest of time what I'm going to do is I'm going to stop it early. And here we have our calibrated intervals, calibrated upper and lower confidence limits added to the report. So let's take a step back and see what happened there. I'm going to go ahead and add this again. Now, one thing that the add-in does, as soon as you run it, is it adds this simulation formula to the data table, so you can see the simulation formula here. When the add-in is closed, the simulation formula disappears. The simulation formula there takes advantage of another feature that's been added to... to the Mixed Models platform in JMP Pro, and that is the Save Simulation Formula feature here. So what this would allow you to do is to save the simulation formula and then to use that, for example, to simulate these values here. So, we can swap out our “Y” with our new simulation formula, and go ahead and run that. So when you run the add-in, this is all done in the background. But this is how the add-in goes about calculating these intervals. So I'm going to stop this early, once again in the interest of time. And now we see here the samples estimated for each. simulation. And so how the add-in works is it takes all of these. And it calculates new estimates for the upper and lower Satterthwaite intervals from this estimate and this standard error, swapping out different values for alpha. So what we're aiming for 0.05, right? So that we get 95% upper and lower limits, and what it does is it finds an alpha value that results in 95% coverage, that is 95% hits and 5% misses, swaps that in, that's how you get your calibrated intervals. So I hope you enjoyed seeing that. I hope you find it useful. We've done some simulation studies and what we found out is that the intervals, which you can see here for four operators and 12 days as our random components, we've achieved misses of about 7%, so a 92.8 hit ratio. Now this is better than all of the others, including this, so the linear combination, which is simply the standard Satterthwaite interval calculated on the combination of linear components, as well as the Bootstrap quantiles, the bias-corrected intervals in the bias-corrected and accelerated intervals, but as you'll see these intervals improve, all of them, as you increase your number of Operators from 4 to 8 and the number of Days from 12 to 24. So increasing the levels of these random components result in much better, much more accurate estimates for the confidence limits, and so much so that we now have a method here that is equivalent, just, to an alpha value of .05. So. this improvement in performance of course, comes at a cost, and one of those costs is the length of the intervals. And so you can see here, that with our Bootstrap calibrated, well with all of our intervals in fact, that when we have increasing number of Operators, that the length of the interval is much more bundled closer to 0 than it is when you've got smaller number of Operators. You can see that this tails out much further, so that's this blue area here. That's true for all of them, but it's especially true for the Bootstrap Calibrated interval. You can see this long tail here. On average, you're going to get longer lengths using this method, but you have a more accurate method. Exploring that a little bit deeper, you can see here that this increase in length is true for four Operators, as well as eight Operators, and it is significant. Statistically significant. The other thing that I looked at, is the effect of adding repetitions, so the difference between two repetitions and five repetitions, and what you'll see here is that there really is no difference. So looking across the different sets of combinations from four Operators and two reps to four Operators and five reps, about 6 measurements total versus 3 measurements, we really don't gain anything. All of these are equivalent to each other. So that's something to be aware of, that you see improvements in accuracy when increasing the number of Operators, and you don't see improvements when increasing the number of repetitions. One thing that I'd like to mention as a possibility to improve upon these results is the Fractional Random Weight Bootstrap, which we would have liked to have been able to implement for this in time for this conference. We weren't able to do that, to take this and to apply it to random variance components, and so we hope to be able to do that in future work and perhaps even see an improvement upon the Bootstrap Calibrated interval. And then the other thing that I'd like to highlight before I go is the new MSA designer that's been added to JMP 16, and so from here what we can do is we can very quickly create our own design in order to be able to perform our own MSA or Gauge R&R analysis. And so let's see, I'll do this with three Operators and Five parts. I'll label these A, B and C. And we'll do one repetition of each. So that's two measurements total. So here we've got a table with our design. What I can do is I can press this button to very quickly send that to the different operators, have them fill out their parts, send that back to me. And then I can add those results together. So I'll just sort this because I've got another table over here where I've done this ahead of time. So I'll just add these values over there. And now from the scripts within the table we can quickly and easily do our own Measurement Systems Analysis and Gauge R&R. So I hope you found this useful. I hope you continue to enjoy the talks at this conference. Thank you very much for listening.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Damien Perret, PhD, R&D Scientist, CEA François Bergeret, PhD, Ippon Carole Soual, MS, Ippon Muriel Neyret, PhD, R&D Scientist, CEA JMP software was implemented at CEA in 2010 by R&D teams who develop nuclear glass formulation. A first communication occurred at Discovery Summit 2011 in Denver, when we explained how we use JMP statistical analysis platforms to compare glass composition domains with a high degree of complexity. Then, many improvements were made by developers to provide JMP with powerful methods for generating mixture DOEs, in order to investigate highly constrained experimental domains. During Discovery Summit 2014 in Cary, we showed how all these efforts enabled us to build even more accurate property-to-composition predictive models. A very innovative methodology was recently developed by glass formulation scientists at CEA in collaboration with Ippon statisticians to predict the glass viscosity. Our approach is based on an automatic and intelligent subsampling of the data, and combines techniques of optimal designs and several predictive methods in JMP and JMP Pro. Predictions appear to be very accurate, compared to those obtained from other statistical models published in the literature. Auto-generated transcript... Speaker Transcript Damien PERRET Hello welcome, and thank you for watching this presentation for the Europe Discovery Summit conference online. My name is Damien Perret. I am an R&D scientist at CEA in France, and I am along with my colleague and friend Francois Bergeret, statistician and the founder of Ippon Innovation in France. My name is Damien Perret. I am an R&D scientist at CEA in France, and I am along with my colleague and friend Francois Bergeret, statistician and the founder of Ippon Innovation in France. So with Francois, we are very happy to be here today, and we would like to thank the Steering Committee who gave us the opportunity to go about this one, which is about advanced statistical methods applied to glass viscosity prediction with JMP. So let's start with a few words about the French Alternative Energies and Atomic Energy Commission. CEA is a French government organization for research, development and innovation in four areas defense and security, low carbon energies, technological research and fundamental research. The CEA counts about 20,000 people on nine locations. We have strong relationships with universities through various joint research units, high amount of patents and start-up creation, with a budget around 5 billion euros. Fran?ois Bergeret statistics and data science, including studies, consulting and training. We are very proud to be general partners since several years. I'm also very happy to present with my friend Damien today. Ippon also proposes advanced solutions for zero defect and process control. I'm personally a JMP user since 1995 with JMP 3. Damien PERRET So our main objective in this work is to create statistical models to predict the glass properties, and for this talk today, we focus on the glass viscosity. To do that, experimental data are coming from both commercial database and from our own database at CEA. We wanted algorithms to be coded in JSL and implemented in JMP Pro 15. The response of the model is the glass property of interest, so viscosity for this example, and the factors are the contents of the different glass components. So, here are some background information. Glass is a non-crystalline solid. It is obtained by a rapid quench of a glass melt, and from a material point of view, a glass is a blend, a mixture of different oxides. So the number of oxides is variable, from two or three in a very simple glass to about 30 and even more in the most complex compositions. There is a long tradition in the calculation of glass properties and we think that first models were created in Germany at the end of the 19th century. Since then, the amount of published literature in the field of glass property prediction has tremendously increased, so that today we have a huge amount of glass data available in commercial database, which also offered and used to predict the glass properties. But despite of all efforts that have been made in the past to predict the glass properties, challenges remains for the prediction of the glass viscosity. And this, because the glass viscosity is a property that is difficult to predict. First the viscosity is very dependent of physical mechanisms that can occur in the glass melt, depending on the glass composition, like phase separation or crystallization, for example. And also the viscosity is the only property having such a huge range of variation up to certain orders of magnitude. So here is an example that shows this difficulty. We have selected three composition of SBN glass, which is a very simple glass, with only three ???. And we applied the best known models from the literature to calculate the viscosity. And then we compare the predicted values with the experimental value we have measured with our own device. So you can see that even for a very simple glass, it is not easy to obtain one reliable value for the predicted viscosity. So here is a picture, a good picture we like to use to give a view of the database, where each dot is one glass in a multidimensional view of the domain of compositions. So a data may come from isolated studies or we can have data coming from studies using experimental designs or we can have data obtained with the valuation of one component at a time. We spent a lot of time in the past to apply different machine learning techniques by using the part of the data found in the entire database. And a classical approach was used on a validation set but at the end, no statistical model with an acceptable predictive capability was found to predict the viscosity. So we have decided to use a different approach. So instead of using all the data, we think it's better to create a model by using data close to the composition where we want to predict the viscosity. So, for example, if we want to predict here on the red dots, one model will be created from the data we have in this area and a different model will be created if we want to predict the property on another composition. So that's why we say that this technique is dynamic. It's because the model depends on the composition. It is related and fitted where we want to predict. And we say it's automatic because we don't have to do this manually. Every step is done by algorithms implemented in the tool. So one of the most important point is certainly the determination of the optimal subset of data to create the model. For that we have implemented two methods of subsampling. So in the first method, a theoretical or virtual design of the experiment is generated around the composition of interest. And then each run of the design is replaced by the most similar experimental data present in the database, leading to the final training set. And the second method we have implemented in the tool is based on different sizes of data sets created around the composition of interest. A small data set is generated by the tool, and models are created on this small subset to predict the viscosity. And then bigger and bigger data sets are generated, and the optimal size is evaluated by statistical criteria associated to each subset. Fran?ois Bergeret Glass viscosity is not easy to predict, so we decided to use different statistics and machine learning method. Polynomial ??? models with transformation, generalized regression using a lognormal distribution. This method is very powerful using JMP Pro and can be give better results that the ??? models with transformation. We also use neural networks, very powerful in terms of prediction. As we have two data sets, as mentioned earlier, we have six predictions for each response. Next slide is a schematic, a view of the tool. Inputs are the composition of the glass and the temperature at which the viscosity has to be predicted. If we look here the code and the algorithm have been implemented for the two method we described just before. The strength of the tool is that, instead of getting only one prediction, six values are calculated with a statistical criteria associated with all data that can be evaluated by the user. Damien PERRET So, here are some of the key parameters. It is very important to take into account as many inputs from the glass experts as possible. For example, we had to create specific algorithms to enter with nature and the role of oxides on viscosity. Another point of major importance is related to the origin and so reliability of the data. For this, a significant amount of time in this project has been spent to the constitution of a reliable database. So we had to implement weights and we had to study different ways of calculating the distance...the distances between the glass compositions. So now it's time.... Fran?ois Bergeret Okay I'm going to show the screen now. To show you a demo of the code, so you should see my screen now. And I'm just executing the code so it's a complex JSL program. We have been developing it during several months with CEA. So I just executed the code and now I'm going to show it to you. Discovery...so here I'm opening files for the code. And it's running, okay. The code is executing, so I will comment, a little bit. We have several loops in this code. Of course, the first step is to identify the data and the functions. And after that we have a loop, first of all, we have what we call the adaptive iteration and the reason for the database. So and because it is adaptive, as mentioned by Damien, you're looking for the best subset of data. And we have also here the design of experiments approach, whereby optimizing design, we are getting the right data. After that and it is running actually, we are predicting the glass transition temperature. Okay, and we have, as I mentioned, three models and for each models, we have two database, so we have a total of six predictions. So, as you see, it's a little bit long to execute, but it's lasting something like one minutes, and when it will be done, we will have all the output of the programs of the glass transition temperature and, of course, the viscosity. Using JSL has been very, very useful and, in addition, in terms of users, as you will see with Damien, it's very easy to use and to to use for the experts. So Damien, you can talk and stop my sharing. Damien PERRET Okay, so. Just one. Can you see my screen now? Yes, okay so. Okay, so this is a general statistical report created by the tool. So first, we have the composition of interest of the glass. And then we have on this graph, the predicted values. So on the Y axis, we have the predicted values of the viscosity calculated by the three algorithms and for the two methods. And on the X axis, we have the number of the enlargement for the second method. And in red, we have the median of the predictions, which can be sufficient for a non statistician user, but if we want to investigate the statistical details, we have a lot of information in this report to study quality of each each model. For example, we can check the values of the PRESSS or the different model. Here is for the multilinear regression BIC F model. So here we see that the PRESS values tell us that the prediction with method number one is a little bit better than for the second method, and we also see the model liquidation with the enlargement of the training set on. We also have different statistical values. For example, we have the R squared value for the different algorithm and for the different models. And here we have even more details on the model. For example, for the first method we can compare the theoretical and the actual design of experiment. We have the predicted... prediction formula for the different...for the different model. Also we have some information on the...on the estimates, and we also have many information for the...for the second method. So at the end, we have a lot of statistical details and information that are very, very useful to the user. And here, at the end, we have the composition of the most ??? of glass in the database, for which we have an experimental value of the viscosity, so this is very, very useful also. Okay, so let's go back to the PowerPoint. Okay, so this is the results we obtain. The two predictive capability was evaluated by extracting 230 rows that forms a global database. And in this table, we have the relative error of the prediction...of the viscosity prediction for different types of glass and for the global subset of data. Three quantiles are given as a median, meaning that 50% of the predicted values have a relative error that is below the value indicated in the table. And we also have the 75 and 90% quantiles in this table. So when we talk about glass viscosity, traditionally we consider that predicted error around 70% is very good. So we can see that for the majority of the data the model capability is fine and we were very happy with the results we obtained there. As a comparison, here are again the results we obtained for the very simple glass, SBN glass, with only three oxides when we applied the models available in the literature. So we can see that the value...values for the relative error of prediction were much higher and could vary a lot from one model to another. And again, this was for very simple glass with only three oxides. And in some case we have errors that are more important, but if we look at this data in detail, so here on this graph, we have the ??? values on x axis and experimental values on y axis, we see that this biggest error of prediction is obtained for for these two glasses coming from the commercial database SciGlass from ... the same reference, which is a patent and for which the experimental error of the equipment is not mentioned. And also for all these compositions that are high aluminum content, we think that crystallization is very likely to occur, and then we can't be totally sure that the experimental values were correct. And finally, we applied our methodology to predict another glass property, the glass transition temperature, which is an important property in glass technology. Here are the results we obtained, which are even better than for the viscosity. And here, so the overall relative error of prediction is below 5%, which is really good, because we know that this property can vary a lot, depending on the thermal history of the glass and depending on the experimental device. So here are the two capabilities, very close to the experimental error, which is very nice. Fran?ois Bergeret Okay, as a conclusion, one important feature of our approach is a dynamic subsampling of the global database. We address the right information around the composition of interest. In addition, using JSL and JMP Pro, we have automatized the machine learning models, general regression and neural networks are very performing. According to the CEA expert, accuracy is good and reveals some unexpected issues. We plan now to expand the models on a bigger database and also to work with Bradley Jones and maybe write a joint publication. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Beatrice Blum, Senior Statistician, Procter & Gamble Service GmbH Phil Bowtell, Principal Statistician, Data and Modeling Sciences With sensors now being economically available, P&G massively expands its use of sensors to develop new and better test methods. Sensors deliver discrete measures over a continuum like time or location often resulting in smooth curves. However, the metrics that we extract from these sensor data are blunt summary statistics like averages, sums and integrals. Those are believed to represent different consumer-relevant product features, but we struggle to establish robust mathematical links. Using historic approaches, a lot of information about the product performance that we measure along the way are not leveraged. We propose to apply Functional Data Analysis (FDA), a mathematical approach to spline fit any type of curves, to extract discriminating curve characteristics representing product features. Using case studies from Baby Care, we show how to turn sensor data into meaningful information. In addition, we compare FDA with PLS in SIMCA to understand when to use each method. We envision that matching these fits with consumer data will enable creation of a product portfolio landscape, empowering us to understand what optimal product performance, the so-called Golden Curve, looks like. Eventually, our goal is to design diapers, pads, razors and more against identified consumer-relevant Golden Curves by optimizing product composition. Auto-generated transcript... Speaker Transcript Beatrice Blum Hello, and thanks for joining Phil's and my Discovery presentation today with a glimpse of my fabulous 2021 lockdown hair style. We will be talking about how we approached some sensor data, why the use of functional data analysis (FDA) and partial least squares (PLS) in our pursuit to catch the golden curve. My name is Beatrice Blum and I'm a statistician in the data and modeling sciences department of Procter and Gamble, supporting baby and fem care R&D in Germany. My co author is Phil Bowtell from the UK. Phil, do you want to introduce yourself? Phil Bowtell Thank you very much, Bea. Hello, my name is Phil. I'm based in the UK and like Bea, I'm a statistician as part of the data and modeling sciences group. And I support a variety of technical sciences in Europe, including baby care with Bea. Thank you. Beatrice Blum So what we want to cover today is a quick introduction to the data that we have collected and how we're trying to figure out the meaning of the different curve shapes with respect to our consumer responses, Yield 1 and Yield 2. We will pay particular attention to comparing two analysis approaches to these data (PLS and FDA) and try to understand when to use which. Note that we assume some knowledge of PCA, PLS and FDA for this talk, but what you really only need to know is the general concepts and data...how data is organized. But it's very likely that you will still be able to follow the course of this talk, even if you're not familiar with it. So you may be aware that Procter and Gamble is developing and manufacturing diapers. To improve these diapers and their product performance in the eye of the consumer, we try to capture and understand the important features of a diaper. Particular in some of our test methods, we apply fluid to the diaper in different locations and under different protocols or conditions and measure K data curves, as seen here on the left, and P data curves, as seen on the right. We assume that these K data curves are somewhat linked to our consumer response called Yield 1, and that these P data curves are related to our consumer response, Yield 2. So let's first look into the K data curves and analyze these or try to fit these with the help of Functional Data Explorer in JMP. With that, I'll switch to JMP. So here is my data table in JMP. It's a very limited data table in terms of columns. We have one column for the 10 products that we have been investigating, A to K. For each of those products we have run three replicates in our method. I combined the two columns into an ID column and it's consisting of the sample name and the replicate number. We collect the data, over time, continuous variable and our raw signal is called K raw. So let's have a look what the K raw is looking like. I just picked two products, in this case G and I, because their profiles seem to be quite different. What you can see, we have pieces of where the curve steps up jumps up and then it flattens down in a quite...quite smooth behavior. The stepping up is no big issue for PLS, which was created in terms of trying to model spectral data, while for an FDA (functional data analysis) that would expect smooth curves and also smooth derivatives, which are probably not given if the curve is just jumping up. These jump ups are related to a sauce(?) that we apply to the diaper. Can also see the three replicates, so the method seems to be nicely reproducible. Quite nice. However, we see a lot of noise in our raw data. It sinks again, oscillating down here and we assume that we will model a lot of noise and overfit just because we have so much oscillation over here. So what we found it indicated to smoothen the curves prior to fitting, and that's what we see down here. We have smoothened the curves by the use of moving average super sample(?) with a window size of 20. With that I'll go over and try to fit functional components to this. So I put my variables into the corresponding roles. Instead of the raw data, I use the smoothened K. I need my ID variable and my ID function. My X is the time over which we measure our K, and eventually what we want to achieve as linking these data to our Yield 1 continuous variable, and try to understand how our predicted curves are related to this consumer response. So I put this into the supplementary role. Run it, get the original output from the FDE, and usually I just start by fitting B-splines because that's nice and easy and relatively fast. Can see that this is only taking a couple of seconds, despite us having quite a couple of thousand lines. So we get a result. It doesn't really look bad...that bad from afar, however let's drill in a little. As already mentioned, when talking about what functional data analysis was developed for, it is expecting smooth curves. And the B-splines do actually just stitch together in this particular case cubic pieces of splines and to get around corners like here, where there is no cubic...certainly no cubic curve but a real change in behavior and a turning point, it has to go around and try to somewhat capture that behavior, but you can see that it's doing a really poor job. It's also not doing a good job in trying to represent these plateaus that we observed at the top. So, despite being very fast, simple and, in most cases, a really good approach in this particular context, it's probably not the best to go for B-splines. Instead of that, I read a little bit in Help and did a bit of research and found Okay, we should use P-splines if we have profile data, if we have something like spectral data. That's what JMP recommends, so we went for the P-splines. And because I really saw that we have step changes here, the only way to attract...attract this is by using the step functions. And since the P-splines take much longer time to fit, I prefit those and we will just have a look at what the results look like. So, again we look at the actual and predicted lots here and see, oh they're doing a much better job in getting these turning...turning points and in achieving the step upwards. They also somewhat captured the plateaus round here, but it's already assumed previously yeah, they also still do capture quite a little bit of the noise. So this is not really smooth. It would be actually super nice to have maybe a B-spline fit this degradation type style of downwards hill. Maybe just step P-spline in this area where it's really needed, but at least this is a lot better than our B-spline fits. So let's see what's happening. So it did quite a good job in putting our different products on it, two dimensional score plot. We can clearly see how the replicates of different products group to each other. There are the A's; we have the H's and so on. Some of them are not so good, so the I's are a little bit further distributed and overlapping with others. However, we can see that the good reproducibility that we saw in the raw data seems to be playing out well here. We decided to go for four FPCs, as seen here. And we can see that they're quite nicely predicting our curves. But, eventually, our goal is obviously to see how our consumer response relate to different curves. So what JMP is doing here in the background, in the generalized regression, is fitting each of those four FPCs by the use of Yield 1. And with the results from that, we can now see how changes in Yield 1 changes the shape of the curves; could clearly see an upward strand. So seems relatively easy to capture what's going on, so this is a really bad product, so it's certainly much up, lot of plateaued up here, and this seems to be a lot more down. So we have found something where we think that may be close to a golden curve for our Yield 1. However, when we are looking at the data that we really collected from the consumers, and now not on a continuous K, but we just put them in order on a categorical scale. We have to see that these four products that came out almost identical from 0.31... 0.031 to 0.034. If we look at the curve shapes and how they change on the left, we will see they're quite different. So it's not entirely in line; you could even say it's not at all in line with what we've seen in fitting the continuous Yield 1. So through the very one down here, and this would...they look so similar despite having quite a big difference already in consumer response. And again, this one is also not so different with respect to what we've seen from the continuous one. With that I return to my slide deck So back to my slide deck. Here we can see how we fit the data that we extracted from the FDA fits to our Yield 1. And you can actually see that this is a very, very good model. It's so good that we always had to doubt that this will hold true on the new data. We did the fit by use of auto validation and model averaging as promoted by Phil Ramsey and Tiffany Rao in the Discoveries America 2020. The R square with 97 and then R...Press R square cross validation R square of 90, it's just too good for us to believe it's true. With that let's look at what Phil found when looking at the same data with PLS. Phil Bowtell So, as you say, we have this R squared of 97% with the press R squared of 90. All looks very nice. Let's just see how partial least squares compares with this. So I've been looking at principal components analysis, partial least squares, which is a tool that we use when we have spectra or curves. It's commonly used because all our inputs are going to be highly correlated and traditional regression techniques don't deal with that so well. And the first thing we noted was that, when we looked at the score plot that Bea had in the previous slide on the demo, it looks almost exactly the same as you get in principal components analysis, so that's where we see some common links. When I run the partial least squares, what I see is I get an R squared of 73%, not quite as good as 97. And also, if you look at the observed against the predicted, we do actually see what looks to be an okay fit, but then obviously Product B is having a bit of an impact and undue influence. And in JMP and in SIMCA we've got cross validation measure Q squared, which is low at 33%. So this isn't really a good model. This was done on the raw smooth data that we had. There are other transformations you can try, but really we weren't able to build a good model. It's certainly nothing that competes with the FDA. However, one thing we do get from the model are coefficient estimates. We also get this quantity called VIP, and these in tandem give us an idea of which particular regions of the curves excite or tell us what's going on with the predictions. So if I just overlay here the VIPs and the coefficients on the raw data plot, the green highlights areas where this is really having a big impact on the predictions, what's contributing towards the model. The orange is medium, not so much. And the gray is low, and this is actually telling us that the first peak is really not having much of an impact whatsoever from prediction point of view. So moving on, I'm looking at another set of data. This is the p data curves, and here we have these curves that have been collected. Four conditions, maybe call them protocols or conditions, at three locations. We also have a fifth protocol or a fifth condition, but this is only taken at Location 1 and that's not plotted here. And what we have is Location 1 on the left, Location 3 on the right and Condition 1 on top going down to Condition 4 at the bottom. And one thing to note that these curves are quite similar. We do see some slight deviations. But one question that was asked is, well, do we need all of these curves? Are they all needed? Or maybe we take a subset and use those to help us understand the data. So what I've done is taken all the products and sequentially plotted them. So I've got Location 1, Condition 1, all the way up to Location 3, Condition 4 and plotted the data. And we can see straight away that there are some common trends; we can also see some differences. So we all we always...we see that there are three products here that seemed to lie away from the others. So we've got some product differentiation. If I look at the different conditions, I can see that, obviously, these products here are certainly changing as we move our change our conditions. As I look at location, it doesn't seem to be a huge impact. Let's see if we can look into this in a little bit more detail. So with any multivariate data, normally, the first thing you would do is just literally throw it into principal components analysis and see if anything comes out from that. So it's an exploratory data analysis tool. And if I look at the score plots, I've taken all the data you've just seen, put it into the package, it's come up principal components, color coded by product. And we can see straight away that products D, G and H seem to lie away from the rest of the products. We've got three products here, seven products over here. And when we talked to the people that develop these products and make these products, it makes perfect sense. So it's good that the data are actually highlighting something we would expect to see. I then highlight by the different locations, and I'm not really seeing a pattern here. I think you'd have to be quite adventurous to say there's something going on there. However, when I color by the different conditions, I do see some pattern emerging. And if I look at the three products that we have here, D, G and H, I can see, as you go from right to left, we're seeing a shift from Condition 1 to Condition 4, and likewise for the seven products here. Condition 5 is sat in the middle, and again, that's something we would expect because it's actually different measuring device. So, from an exploratory point of view, we can see these differences. Let's see if we could look this from a more statistical points of view. And for the example I'm going to look at, I'm just going to focus on Location 3 and looking at the four conditions within Location 3. To do this, I'm going to be using multiblock orthogonal component analysis, which is a bit of a mouthful, so it's just reduced to MOCA. And, I'm also going to be looking at hierarchical modeling but I'm not going to be doing that...discussing that too much in the context of this talk, I'm going to be focused...focusing on the MOCA and these are two techniques that we find in the stats package SIMCA. Now the idea here is that we look at blocks of data, and traditionally each block represents a different way of measuring some kind of chemical or some kind of product. So as an example we've got near infrared, infrared and raman spectroscopy. And the two things that we aim to do with MOCA and with the hierarchical modeling is first of all, assess redundancy. It might be that I just need near infrared and raman spectroscopy for the prediction and, in fact, if I know near infrared and I know the raman, I can actually predict the infrared or it's redundant. So that's what going to be looking at, but one thing to note is when I talk about redundancy it doesn't automatically mean I can throw that particular block out. Because on its own right, it may add to the prediction. So it's a balancing act between redundancy and predictability. As we've seen already, if I look at these charts here, it looks like there may be some redundancy. So what's actually going on? Well let's think about this in terms of overlap. I've got Location 3 and I've got my four conditions. And to express overlap I've got a wonderful venn diagram and we can really start trying to understand what kind of information we've got. The first point of information is globally joint information, which is where we have information that's common to all four conditions. Then I can have information that's only common to two or three of the conditions, locally joint information, as highlighted in the orange here. And finally whatever's left over, is the unique information, and this is what the different conditions bring to the party in their own right. So what might be expect to see here? Well, the left hand side would indicate a situation where the four conditions actually have quite large amounts of independent information. And we'll probably think, no there's not going to be any redundancy here, we have to keep all four. The image on the right, where we have a large amount of globally joint and locally joint information and not so much unique information may indicate a situation where we could have redundancy, and it might be we get away with looking at one or two of the conditions and not all four. In terms of what SIMCA does, it does some modeling and we've got our four conditions and it's fitted four components, two of which are looking at joint components and two of which are looking at unique components. Overall, we explain a lot of the variability of the data. We can see from the numbers at the top here explaining nearly 100% of the variability. That's a good thing. The green bar is looking at the global joint informaton and we can see, these are really quite large; we've got a high number. So this is just telling us there's a large amount of overlap between these four different conditions. If I look at the locally joint information in orange, there is some, just between Conditions 2, 3 and 4. It's not huge. And finally, we can look at the unique contributions that each of our unique information each of the conditions have, and that's quite small. So here we are pretty much certain there's going to be some redundancy. We can also investigate where the uniqueness comes from in terms of products. So the size of the bubble here indicates whether it's unique or not. Now, if we have no uniqueness, we get very small bubbles, so, for example, Products I and F are very small. We would expect, if we looked at these individually, just to have lots of green green bars here. If we look at Product H, we see a little bit more independence. It's telling us there's a little bit independent information enough for conditions. It's not a big bubble though, not at all, and in reality it looks like there's a lot of redundancy. So what was then done is we've taken all 13 possible conditions and locations, we've run the MOCA analysis and also the hierarchical model, and from this, it's telling us, ideally, we need the first condition at Locations 1 and 3, and the fifth condition at Location 1, which really sort of goes against what we've been saying. You know, we were saying earlier on, well, you know the conditions differ a bit and locations don't differ at all. Well, not quite, there's obviously something going on in terms of predictions, but we have been able to go from 13 different combinations down to four, with a fairly good model, R squared of 85.5 and Q square to 69.3. The cross validation measure isn't bad; be nice if these two are closer, but at least we are modeling our data here, in this case Yield 2, better than we have been able to in the past, using this is P data. So with that, I shall pass back to Bea. Beatrice Blum Thank you, Phil, for this super interesting insights that you could provide. Now let's look what FDA does to the P data and what we can get from that. Similarly to what Phil showed us, we can see that location as a factor, if you may say, is not having as much of an impact as the different four conditions are. So they vary the results much more. And we can also see, that's quite interesting, when we look at the two best performing products over here with a 74 and 72, that they are curves, the corresponding prediction curves, they look quite different, even much more different than what we saw in the K curves, maybe even. So if we are going for the golden curve and try to figure out what is really the best performing profiles, this is giving us a hard time because it's quite difficult to find commonalities between these two curves. So question is really, are there several golden curves or may an average curve between the ones that are already performing good be what we have to go after? We can't answer that yet. We also can't answer if there are any redundancy or what location-condition combinations we really need to measure from this. However, again when I extract the summaries from these FDA and I tried to build a model and predict Yield 2, I again get a super good model with an R square of 98 and press R square of 97. Yes, we do know already that this is too good to be true. On the other side, we did try modeling the same Yield 2 with other extracted values from our curves as they were provided to us from the measurement department. But trying to model those extracted values and to predict Yield 2 did not work out at all. The R squares we could achieve were not even close to the ones we get here. So we do think something is going on and we made quite a huge step forward in trying to understand how we can model Yield 2. So let's wrap up what we found. We've seen that both models, or both methods rather, result in very similar principal components and that they agree in terms of what commonalities they extract from the K and the P curves. FDA might probably be a little easier to use, with little data prep. It allows us to predict curve shapes from Yield 1 and Yield 2 and that does give us some idea of what our golden curves may look like. It also indicates what measurement practical factor seem more important to keep. We are able to get really good models for our yields but do profoundly question these models. At the moment we couldn't extract which location-condition combinations are most relevant to keep. That's something we would like to follow up with JMP. PLS, on the other side, doesn't give us very useful information about which sections of the curves are most informative in terms of discriminating our products. MOCA and hierarchical PLS additionally pinpoint us to the measurement protocols that we do need to keep for capturing the most relevant information from our P curves. The PLS models to predict yield appear somewhat more reasonable in terms of goodness of fit metrics than the FDA models did. Our combined efforts helped us find clear patterns to differentiate our products. Both PLS and FDA enable us, extracting essential features from the traces and calculating good prediction models for our yields. We also learned that aspects of the curves and what protocols are most meaningful. We managed to model or predict our yields much better than we could in the past. That's a huge step forward. This is a very good example actually where too many cooks did not spoil the broth. Both methods agree to a certain point, but they also both do give a different additional information where they differ from each other. The work is ongoing; we will expand from here. Particularly, we will add new independent data to validate or improve the models. We still need to fine tune which protocols we need to keep to give us the most relevant and the least redundant information. That's something where we hope JMP will help, enabling that in the FDE platform. Our final goal is to understand what material composition of diapers will result in which curve shapes and how those curve shapes relate to consumer yield. In our pursuit of the golden curve, we made good progress and are excited to eventually fully capture it. With that, we thank you for your attention and are open to questions now.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

André Caron Zanezi, Six Sigma Black Belt, WEG Electric Equipment Danilo da Silva Toniato, Quality Engineer, WEG Electric Equipment Quality assurance and customer needs are rigorous terms that frequently refer to reliability. Improving products in terms of reliability challenges engineers in multiple ways, including understanding cause and effect relationships, and developing tests that reproduce customer conditions and properly generate reliable data without exceeding the product launch time deadline. Combining engineering expertise, historical data and lab resources, a design of experiment (DOE) was performed to quantify the product lifetime based on process, product and critical application variables. Performing several analyses using JMP tools, from the DOE platform to the Reliability and Survival modules, the team was able to describe the product lifetime as a function of its critical factors. As a result, an accelerated life test was established which is able to simulate years of product usage in just a few weeks, providing solid evidence of some specific failure modes. Standardizing its methods and procedures, the test became a crucial requirement to verify and validate new technologies implemented at WEG motors, optimizing the development process and reducing time to market. This poster provides information about how we used JMP to analyze data and develop an accelerated life test. The project followed the step-by-step approach: Project charter: understanding the primary and secondary objectives, the multidisciplinary team was formed to share information and knowledge about, customer historical data, lab resources, motor reliability, cause and effect relationships, environmental application conditions and reliability data analysis. Historical data analysis: knowing and quantifying risks about analysing historical data, the team fitted some life distributions to understand Cycles to Failure (CTF) scale and shape parameters. Mainly, shape parameter refers to the failure mode to be reproduced in the lab tests, according to the Bathtub curve. DOE planning and analysis: in order to reproduce failures, the understanding about motors reliability was endorsed by cause and effect relationships, provided by a Fault Tree Analysis (FTA). The FTA was a source of critical variables combined into a Designed Experiment (DOE) to quantify how to accelerate product cycles to failure. Conclusions: as a result, DOE provided a surface profiler with the indication of the best condition to accelerate products life time. The accelerated test also provided a shape parameter, that when compared with historical data shows an overlap, meaning that the same historical field failures were reproduced under controlled conditions. Implementation: with an accelerated test, development and innovative process will became faster while providing important information about product reliability. Auto-generated transcript... Speaker Transcript André Zanezi So, hello, everyone. My name is Andre Zanezi. And as I am a Six Sigma Black Belt at Weg. And I'm here today in Discovery Summit to talk about the development of an accelerated lifetime test to demonstrate and quantify washing machine motors reliability. We know that some...every company when they are developing new technologies, new solutions, they often face some challenges when they have to improve their reliability in their product. We face the same problem, the same challenges. And the project should analyze and understand our historical reliability data to quantify our historical reliability data. And try to develop a procedure, an accelerated life test in our internal labs, labs to reproduce our failure...our viewed failure modes. Basically doing it, we could develop... develop models in the first two way. So at the first step, we get some historical data from our motors. And using reliability and survival modules in JMP, we fit some life distribution for our motors. And we know that doing in fitting some life distributions as Weibull distributions, we can understand our motors reliability, our motors lifetime. And in JMP, we also can use lifetime distributions and fit different distributions for different failure modes. And we did it for four main different failure modes. And we compared it, we analyzed it and understand or understood our ...our motors lifetime, our motors reliability. And doing it, we were capable to understand and to quantify our scale and shape parameters and it basically doing it, how much cycles was necessary to have a failure. And also, according to the ??? to know which kind of failure modes we are facing. And we have...we we did also cross check with our internal...our validation KPIs, basically plotting survival. We have survival plots and cross checking with internal KPI's to understand if the probability and the failure range was correctly with...if our data was reliable. And understanding all these failures modes, we could we could develop an internal test to accelerate on...accelerated our internal ...our internal test. And basically to do it. We should understand the physics, the environment... that environment conditions that will our motors are working in. We did it through fault tree analysis, basically deploying and understanding the the cause and effect relationships. Doing it, we could set the most critical variables in in this cause and effect relationships. And again, use JMP to do to design an experiment in order to try to quantify the effect of some variables in our response in our cycles to fail. Basically, we were trying to reproduce field failures in our labs. And we did it, we will run several tests, and as a result of our experiments, we could have...we could set and fit some using fit model...fits on models to our data and understand the relation of our environment and motors variables to cycles to failure, and understand to the survival plot, and sort through the surface plot, understand the relation of some variables with cycles to failure, and set specific point to accelerate our, our motors' lifetime. And again, running some some batch of samples, we could set and fit lifetime distributions for our internal results, for our internal tests in accelerated life test. A we were seeing some failures but at the end of this experiment, these accelerated tests, we we should ensure that we are facing and we are causing the same failures as we had in historical data. So we come back to the lifetime distributions in survival and reliability in survival module. And again, fit some Weibull distributions now for our internal...internal results for our accelerated lifetime tests. And we noted that basically they shape parameter, the parameter of that, according to the best to means the filler mode and we cross this information with our historical data and we can see that crossing both informations, we have an overlap between the the shape parameter of internal test and the shape parameter of historical data. And it means basically that we are having...we are not reproducing the same failure modes in our accelerated life test. And basically it means that now we can develop products in a faster way because every time when we have a new technology and new design, we can put it on this accelerated life test and quantify if we are improving our motors reliability. And we can make it faster than before, and develop faster...develop products in in the faster way. We also did some technical cross checks to to prove that we are facing in reproducing the same failures to to implement in this this test in into the development process so that that this was how we used JMP to provide a lot of information and put it on our internal test. It was made by really good teams. Please feel free to make some contact and send email if you have any questions. And that's the end.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Wednesday, July 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Wednesday, July 7, 2021

2020年，Covid-19新型冠状病毒在全世界范围内大流行，临床试验的进程也会受到影响。在本次演讲中，演讲嘉宾将通过模拟案例从多个方面演示如何应用JMP Clinical对新冠疫情期间的肿瘤临床试验实施医学监测。包括基于风险的监查和入组规律监查，疫情对方案偏离的分布情况的影响，疫情原因导致延迟用药的类型、天数分布情况，肿瘤反应及无进展生存分析等，并生成相应的受试者列表及受试者简历，方便医学审核数据并及时做出应对。

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

My career is a testament to John Tukey's statement: “The best thing about being a statistician is that you get to play in everyone’s backyard.” I have been fortunate to work in academia, industry, and government; for a small business and a large bureaucracy; on collaborative projects with partners from many disciplines; on topics ranging from reliability to counterproliferation to atomic-scale characterization of materials. I will highlight some of the inflection points in my career and share lessons I have learned about the value of diverse practitioners with diverse perspectives collaborating on solutions.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

Thomas Walk, Large Plant Breeding Pipeline DB Manger, North Dakota State University Ana Heilman Morales, Large Plant Breeding Pipeline DB Manger, North Dakota State University Didier Murillo, Data Analyst, North Dakota State University Richard Horsley, Head of the Department of Plant Sciences, North Dakota State University Crop breeders, often managing numerous experiments involving thousands of experimental breeding lines grown at multiple locations over many years, have developed valuable data management and analysis tools. Here, we report on more efficient crop evaluation with a suite of tools integrated into the JMP add-in dubbed AgQHub. This add-in provides an interface for users to first query MS SQL Server databases, and then calculate best linear unbiased predictors (BLUPs) of crop performance through the mixed model features of JMP. Then, to further assist in selection processes, users can sort and filter data within the add-in, with filtered data available for building reports in an interactive dashboard. Within the dashboard, users segregate selected crop genotypes into test and check categories. Separate variety release tables are automatically generated for each test line in head-to-head comparisons with selected check varieties. The dashboard also provides users the option to produce figures for quickly comparing results across tested lines and multiple traits. The tables and figures produced in the dashboard can be output to files that users can readily incorporate into variety release documentation. In short, AgQHub is a one-stop add-in that crop breeders can use to query databases, calculate BLUPs, and generate report tables and figures. Auto-generated transcript... Speaker Transcript Curt Hinrichs Alright, Tom Walk, with Anna and Didier with their poster on AG.Q.Hub. Tom, take it away. Tom Walk Thank you so much, Curt, and thank you to the JMP community for inviting us to this presentation. We're so glad to be here to show you our work. Today we're going to talk about a tool we're building at North Dakota State University in the Department of Plant Sciences called AG.Q.Hub. Its primarily the work of our team, the plant breeding database management team, of myself and Anna and Didier and Rich Horsley who's the department chair of plant sciences. And what we've done is we're trying to help the plant breeders who've had this long established cycle. You can imagine that if you want to improve crops, it's going to take a long process. If you want to do it right and consistently, you have to have set up a lot of experiments. You're not just going to get lucky very often and choose the best crop. So what you have to do is, you have to set up a lot of crosses and a lot of trials with thousands of lines initially, and from those, you have to go through this decade-long cycle or more, and choose...make choices every year about which were the best lines to advance. and this is...all would change with environmental conditions, so we have to get that right combination of genes with the environment. And you have to have the right analyses and experimental designs to do that, so this gets very complicated for a plant breeding team. And it's a long process to make any variety selections. And what we want to do is to make the selection processes easier every year. It'd be nice to shorten this whole process, but our more immediate goal is to to make the process more efficient, the selections more efficient at each stage. And what we've done for that, is we're developing this tool AG.Q.Hub, and we were using that with our breeding programs. We have 10 breeding programs within plant sciences at NDSU and over 60 users. We've incorporated also two research extension centers with more variety trials and field sites. And this is...this list is growing. We're trying to add more users and will probably add in more research extension centers. So what what's nice about AG.Q.Hub and the reasons why we have these users is because we have the functionality at AG.Q.Hub and it allows you to connect to the database directly and you can see data from decades worth of experiments. And once you have that you can do this analysis. You can look at experimental designs, you can view the histories of your varieties, you could look at distributions of data for individual experiments, you can calculate blups and make those predicted values. And once you have those predicted values, you can get those head-to-head comparisons, and if you have those compare...once you have those head-to-heads, then we can start building reports. And that's where we're going to get into is, we want to be able to make reports with subsetting data and build tables and the visualizations that make it a lot easier for our users. And all this is going to be done within seconds, with a few clicks in AG.Q.Hub and it's saving what's up to weeks of time in the past where users using spreadsheets and workbooks. So just to give you an idea of a workflow in AG.Q.Hub, here's one cycle of generating data for reports. And so what users will go into AG.Q.Hub, they'll select a database they want to use, and what type of analysis or query they want to have, and the output they want to have. And once they...then they'll click start and then after that, another window will pop open that will prompt them for the parameters for the queries, such as the experiments they want to query, the years, the traits or treatments that they want to look at. And after they select their parameters and click OK, then the data will pop up in these data tables and the data tables are in... within the AG.Q.Hub add-in, so all the data tables are compiled nicely within these tabs in here in AG.Q.Hub. And then here are some of the newest features we have is that users can fill...they can select the varieties they're interested in, and they can sort, and they could select, and then they could do some filtering, and then they can make these filtered variety tables. And with with these filtered variety tables, they can export those into their reports into Excel or other documents. And once you're done with one, you can start over and move on to another set of experiments. And click cancel after you do as many analyses as you wish to. We're still working on this, it's a work in progress and we get a lot of great ideas from our users. What we want to do... we're always...we're always expanding this to more users and research extension centers. That's been helpful for us to build this up. As we do that, we're looking to compile templates and release...of release tables used by the programs. With that we can build up some sort of output tables that make it easier for the users to produce head-to-head tables and variety release tables. And then we'd also like to make it easier by adding visualizations for making quicker variety comparisons. And Anna has some great ideas with that, with her experience as a plant breeder. Finally, we there's always...we're always looking to make the interface more dynamic with maybe changing options as users click things. And with that, that's my talk, but I would like to start this video, this short video to give you an idea...give you a better idea of what AG.Q.Hub does. And Curt, thank you for this opportunity again. With that, I do have one more thing I'm excited to show you and I'm going to request the share screen. I want to show our users one more thing, and this is the newest things with AG.Q.Hub. This is what we're excited...this is the direction we're going. What we have...what we have here is once the users make their selections and filtering and what they want to make tables.... the varieties they want to make tables with, we're making dashboards that open up and they can select among their filtered varieties, for which ones they want to be check lines. Like, for example, we want to be this historical varieties to be our check lines, and we want to compare our new test lines against those check lines. And we're going to select the different types of traits we want to look at, the traits we've seen in the field versus traits we measure in a laboratory or traits such as disease traits. And then we click make tables and it'll output these tables. I'm not going to do that right now to save you a little time. But what it's going to do is output tables for each of these traits, for each of these varieties, and you can see it's still needs some work. I need to put the names of the varieties in, but it's still...we're working on this, but I'm very excited and I wanted to show you this before I end. And that's the direction we're going and we're going to build on this dashboard to keep making these tables and make outputs for our users so they can put these the better formats for Excel. They can format on further outputs to Excel and Word or whatnot, and build visualizations where we can do head-to-head comparisons by comparing how these...this variety does against this variety. So we're building up on these dashboards. We are very excited about this and I'm so happy to show this and share this with the JMP community. And before I go, I want to thank all the North Dakota State University Anna, Didier, Rich and myself, and all our other users. Thank you so much.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

Aishwarya Krishna Prasad, Student, Singapore Management University Ruiyun Yan, Student, Singapore Management University Linli Zhong, Student, Singapore Management University Prof Kam Tin Seong, Singapore Management University There are several reasons for a flight to be delayed, such as air system issues, weather, airline delays, security issues, and so on. But interestingly, the most frequent reason for a flight delay is not about weather but about air system issues. The Federal Aviation Administration (FAA) usually considers a flight to be delayed when it is 15 minutes or more late in arriving or departing than its scheduled time. Flight delays are inconvenient for both airlines and customers. This paper employs dynamic time warping (DTW) techniques for 54 airports in the US. The study aims to cluster airports with similar delay patterns over time. In addition, the paper builds some explanatory models to explain the similarity between different airports or distances. In this analysis, we aim to use the time-series techniques to discover the similarity in the top 15% busiest American airports. This paper first filters the top 15% busiest American airports and calculates the departure delay rate for each airport and then uses DTW to cluster these airports based on departure similarities. Next, the similarities and differences between clusters are identified. This analysis will help inform passengers and airport officials about departure delays at 54 American airports from January to June 2020. Auto-generated transcript... Speaker Transcript ZHONG, LINLI _ Okay let's get started. Hi, everyone. This is the poster of time series data analysis of flight delay in the US airports from January 2020 to June 2020. We are students of Singapore Management University. I'm Linli. YAN Ruiyun I'm Ruiyun. Aishwarya KRISHNA PRASAD And I am Aishwarya Krishna Prasad. Now let's quickly dive in to the introduction of our project. Over to you Linli. ZHONG, LINLI _ Thank you, Ash. In the left hand side, we can see that there is a line chart. This shows the annual passenger traffic at top 10 busiest US airport and... in in the...from the graph, we can see that the number of the passengers in each airport experienced a sharp drop. This is because the passengers in airports showed the response to the spread of the COVID-19 in 2020. And for our analysis, we would like to discover the delay similarity of top 15% of airports in America from features of the delay and geographic location. time series, dynamic time wrapping, exploratory data analysis. The time series and DTW are employed to find out the similarities between the clusters, based on the departure delays. EDA is used to draw the geographic map. Okay, let's go back to the data set. Thank you, this is the data set. Actually, our data set comes from the United States of Department of Transportation and from our data preparation in the left hand side, this is the process of our data preparation. We firstly imported the csv file into JMP Pro 16.0. And then we remove the columns and values which are not really useful for our analysis. And after that, for the data transformation, we summarize the data for airports from different cities, and then we filter out the top 54 airports, which is based on the total number of the fights in each airports and calculate the rate of the delay. And after the data preparation, we save this file as SAS format and we import the SAS format into the SAS Enterprise Miner 40.1 for our further analysis, namely the DTW analysis and time series analysis. After DTW process JMP Pro 16.0 was used again by finding out the singularity of different clusters and draw geographic maps. And this is the introduction for data set. Let's welcome my partner to introduce more about our analysis. Aishwarya KRISHNA PRASAD Thank you, Linli. Now let's dive into the time series and cluster analysis. So we did the time series and cluster analysis using the SAS Enterprise Miner. So this graph is one of the outputs that we obtained using the DTW nodes in SAS Enterprise Miner. So in the X axis, you can see that, you know, it contains the months from January 2020 to June...to July 2020. And in the y axis, we can observe that there's a percentage of delays in the flights that we have included in our data set. Now we can see that there is a sharp spike in the early February and in the late June, which seems to be strongly correlated with around the holiday periods of USA. But, in general, other than these two spikes...major spikes, we can also see a steady decrease in the number of flight delays in general. We then performed a time series clustering based on hierarchical clustering and the constellation plot of the same can be observed over here, using SAS. And we chose that...we felt that the number of clusters (7) is the most optimal number of clusters for our analysis. Now, these are the clusters that are formed by using the TS similarity node of the SAS Enterprise Miner, so let's just take a...quickly take the instance of Cluster1. So in this Cluster1, it contains mostly the international airports in the US. So some of these airports are the Denver international airport, the Kansas City international airport, the Washington international airport, just to name a few. So the delay in these airports are pretty large, as you can see, and this can be attributed to, you know, because this is located in the city that is frequented by tourists. So similarly, the remaining clusters are formed by this similar behavior of the delays that are experienced in the flights. Now the clusters that were generated in the previous step was then fed into the JMP Pro, and using the Graph Builder functionality, we were able to build these graphs. So this graph contains the causes of delays in each of the clusters. So in over here, we can clearly see which causes of delay is more prominent in each cluster. So for example, as you can see for Cluster1, the late aircraft delay, that is, the delay caused by the previous flight to the current flight is more prominent compared to the rest. And the same queue follows for the rest of the clusters. But if you see this cluster, right, so although this visualization in SAS is pretty intuitive, we felt like for a...for a data set with large number of points, or more number of airports in our case, it would be quite difficult to analyze. So I'm just calling upon my peer Ruiyun to present another approach to analyze the clusters. Over to you, Ruiyun. YAN Ruiyun Okay, geographic location is another part that we focused on. The clusters were formed in SAS then we used Graph Builder feature in JMP Pro 16.0 to generate this map to show where the different airports are located by cluster. Obviously airports from western and middle US are only included in Cluster 1 and Cluster 3. And these two clusters show that cluster is not distributed in a specific region. Cluster 2, Cluster 4, Cluster 5 and Cluster 6 demonstrate an aggregation of airports with specific region. Airports from Cluster 2 are mainly concentrated in eastern United States, while the Cluster 5 and 6 are more likely contain the airports of some tourist attractions, such as Houston, Phoenix, Baltimore, and Honolulu, which are the largest cities of Texas, Arizona, Maryland and Hawaii. Even more to the point, Phoenix Sky Harbor International Airport is the backbone of national airlines and southwest airlines. That's one of the key transportation hubs in the southwest America. In addition Cluster 7 is a particular case, as it just has one airport, San Juan airport from Puerto Rico. We surmised that because of the special geographical location, any flight departing from San Juan airport has a long distance to travel. And that's all about the geographical analysis and now my partner Aishwarya will give us a conclusion. Aishwarya KRISHNA PRASAD So in conclusion here, we tried exploiting the ease of usage of the DTW nodes in the SAS Enterprise Miner and also the sophisticated visualization and pre processing techniques in JMP Pro 16.0 to perform our time series analysis for our flight data. So we performed the dynamic time clustering for 54 airports. And these airports were formed into seven clusters, based on the delay patterns during January and June 2020. We observed that the carrier delay is mostly the main reason for delay in each cluster, while the late aircraft delay is not very far behind on being a major cause of delay in most of the clusters. As part of the future work, one can include the COVID data points to improve this analysis further and also discuss the correlation between the delay and the cancellation rate of flights. Thank you so much for listening to us. I hope you liked our presentation.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

Zoe Toigo, Signal and Power Integrity Engineer, Microsoft Priya Pathmanathan, Senior Signal and Power Integrity Engineer, Microsoft Martin Rodriguez, Power Integrity Engineer, Microsoft Doug White, Principal Signal and Power Integrity Engineering Manager, Microsoft Ever-increasing complexity of computer systems demands electrical power delivered efficiently to the chip. The design challenge of a power delivery network (PDN) is to provide stable, low-noise voltage through low-impedance paths, which influence overall system performance. Accurate models of a proposed PDN are necessary for initial system architecture decisions and continue to drive layout requirements as the physical design matures. One portion of the PDN design process involves creating a model of the chip’s package in a 3D electromagnetic field-solver tool (HFSS). Complex S-parameter models from FEM (Finite Element Method) field solvers are often simplified to circuit element approximations. Previously, input parameters to a two-dimensional circuit approximation of the package were manually fitted until the circuit matched the 3D model. However, custom DOE and response surface fitting in JMP reduced the number of experimental simulations and development time for model creation and correlation. The prediction profile revealed the polynomial relationship between 12 factors and six responses. Desirability functions were utilized to determine the values of the factors required to obtain the desired responses. Using this data, predicted responses were correlated to circuit simulations. Auto-generated transcript... Speaker Transcript Zoe Toigo Hello, my name is Zoe Toigo. I'm a signal and power integrity engineer at Microsoft, and my project is titled Power Delivery Network Model Prediction and Correlation. The power delivery network for the computer chip consists of all the interconnects, from the voltage regulator module to the pads on the chip and the metalization on the die that locally distributes power and return current. Because it interacts with the whole system, its quality is vital to overall system performance. Design of the PDN is ??? throughout the entire product design cycle. Early on, we can create models of proposed architectures and give feedback on how this would impact the system. And then, once the design is further refined, we begin an iterative cycle of working with hardware development to refine our MOD tools to match their performance and to also provide requirements for next revisions of the physical design. Earlier this year I was working on modeling a portion of the power delivery network, the chip package. Because this was done in a finite ??? for HFSS, small changes to the model take a long time to simulate, and so we created a 2D circuit approximation of this model. Because we weren't seeing good fit between the two different models, we turn to JMP to improve this process. We started by creating a 12 factor custom design of experiments platform, where our 12 factors were values of lumped elements in the circuit, such as resistors and capacitors. The out...the table generated by the DOE was used to run batch simulations of the circuit, and then from each of those simulations, we extracted values at port...six ports on the network, which became our six responses of the DOE. After all of that was finished, we did a fit of the model using least squares, and for each of the factors, we saw between a 97 and 99 R square fit. So we are confident using this model going forward to correlate this 3D and the 2D packages. With the...with the prediction profiler, we also applied desirability functions so that we could quickly get to the values of the circuit that would match our our 3D HFSS model. But to use this in the future, the production profiler has the added benefit of being able to tweak to show dependencies between the factors and the responses for small changes. This work would not have been possible without the help of my team members and also some theoretical concepts are leveraged from Eric Bogatin's book regarding power delivery networks. Thanks so much.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

Suling Lee, SMU, Singapore Management University COVID-19 vaccines play a critical role in the attempt to assuage the global pandemic that is causing surges of infections and deaths globally. However, the unprecedent rate at which it was developed and administered raised doubts about its safety in the community. Data from the United States Vaccine Adverse Event Reporting System, VAERS, has the potential to help determine if the safety concerns of the vaccines are founded. As such, this paper uses the combination of both structured and unstructured variables from VAERS to model the adverse reactions to COVID-19 vaccines. The severity of the adverse reaction is first derived from the variables describing the vaccine recipient outcome following a reaction from the VAERS data sets. Next, unstructured data in the from of text describing symptoms, medical history, medication, and allergies are converted into a Document Term Matrix and these combined with the structured variables helps to build a model that predicts the severity of the adverse reaction. The explanatory model is built using JMP Pro 16 using Generalized Regression Models and Binary Document Term Matrix (DTM), with the model evaluation based on RSquare value of the validation set. The optimal model is a Generalised Regression model using the Lasso estimation method for Binary DTM. The key determinants contributing to the adverse reaction from the optimal model are number of symptoms, period between vaccination onset, how the vaccine are administered, age of patient, and symptoms related to cardiopulmonary illness. Auto-generated transcript... Speaker Transcript Peter Polito How are you doing today. Can you hear me. If you are speaking, I am unable to hear you. Hello. test test. Oh Hello Leo can you hear me. yeah sorry about that I think of the technical difficulties yeah. Peter Polito Oh no problem at all. Oh okay it's a nice Jimmy. Peter Polito Oh, where. Are you calling in from. Singapore. Peter Polito Singapore all right, how how late, is it there. About 909. Peter Polito yeah well. yeah kudos to you for. hanging on. so late thanks for making it. Possible no it's all right um yeah I hope I do a good one. Oh. No i've been like high stress about this Oh well, yeah i'm Okay, let me just put on a virtual background. Okay. Peter Polito And I gotta go through just a couple things on my end before we officially start. Okay. Peter Polito just give me a. minute here. yeah. Peter Polito I only bring in my checklist make sure I do everything correctly here. alright. So just to confirm. You are soothingly. yeah and your talk is titled a model for coven vaccine adverse reaction. Yes, is that correct yeah that's right. Peter Polito All right, and then just to make sure you understand this is being recorded for the use and jump discovery summit conference will be available publicly in the jump user community do you give permission for this recording and use. Yes, okay great. your microphone sounds good, I don't hear any background noises is your cell phone off and all that kind of thing anything that might make some random noises. hang on. i'll send it will find them yeah. Peter Polito Okay, and then. We need to check the can you go and share your screen and we'll go through and check the resolution and a few other things. Okay sure. Peter Polito Thank you. i'm sorry, is it Lisa ling or soothingly. My first name issuing nicely yeah. Peter Polito got it okay. um is it okay. Peter Polito I don't see it yet oh um you know pie covering it just a moment. That looks good. And if you go to the bottom of your screen does your taskbar actually I don't see your taskbar so we're good. Okay. Peter Polito And let me make sure. It are any programs that might create a pop up like outlook or Skype or any of those are those all closed down and quit. um yeah, I think, so I bet the checking on them. close my kitchen. Okay. yeah good. Peter Polito Okay, and then, are you going to be working just from a PowerPoint or you be showing jump as well. I was worried of the transiting, so I will be destiny it from PowerPoint. Peter Polito Okay, great, then I am going to mute and turn off my camera we are already recording so as soon as we. As soon as you see my picture go away go ahead and start and i'm not going to interrupt for any reason and we'll try and go through it's a 30 minute presentation so let's go through, and I won't even be here it'll be like you're talking to yourself. Okay, so that the Minutes right okay. Peter Polito Are you ready to go. yeah. Peter Polito Okay. All right, and it's just so you know when we actually. Have the discovery summit, if you realize tomorrow that you misspoke or you wanted to present something in a slightly different way. You can be live on the when your presentations going, and you can ask the person presenting a deposit and then you can say you know i'm about to say this, what I what I wanted to convey is that you can kind of like. edit in real time during the presentation so don't stress about getting every word perfectly just relax and and go through it and and i'm sure will be just fine. All right. yeah. Peter Polito All right, i'm gonna mute and turn off my camera and then you go ahead and begin okay. Thanks Peter. Hello, I'm Suling. I'm a master's student at Singapore Management University where I'm currently pursuing a course in data analytics at the School of Computing and Information Systems. So I'm actually here today to present an assignment that has been submitted for my master's in IT for Business program and, more importantly, I want to share my JMP journey so far. So I started using JMP this year and I really fell in love with it because of the ease of use and the range of statistical methods and the visualizations that I could do on it. So, as the beginner using JMP I'm really honored to be here presenting my report and do let me have your feedback, because I feel that I have to so much more to learn yeah. So the motivation for my paper was actually to look at the COVID 19 vaccines, so we know how important they are but at the unprecedented rate at which it was developed and administered has raised some doubts in the community regarding its safety. So we are using data from the United States vaccine adverse event reporting system, yes. So we are using data from there because we find that there's a potential to help determine if the safety concerns on the vaccine are founded. So this paper makes use of both the structured as well as the unstructured data from VAERS to model, the adverse reaction of COVID-19 vaccines. So what is VAERS? So the Center of Disease Control and Prevention and the US FDA have had this system, and it is actually a adverse event system where it collects data. But generally what we see is that VAERS data that cannot be used to determine causal links to adverse events because the link between the adverse event and the vaccination is not established. So what we actually see here is that you have people who are reporting, but there, they are people who are reporting the events, but then there is no full of action that is to confirm that these symptoms and events that are reported, are there any link to the vaccine. So why do we still want to use this data? So firstly the data is available and public domain. The data is up-to-date and, more importantly, not all adverse events are likely to be captured during the clinical trials due to low frequency. So usually for clinical trials, they include only the healthy individuals. So special populations, like those with chronic illnesses or pregnant women, these are limited so the they know that VAERS is an important source for vaccine safety. So for more information regarding it, you can look up this link over here. Yeah so the data set used for this study comes from tree data tables that extracted from VAERS. The first one is the VAERS data. It mainly contains information about patients profile and the outcome of adverse events, so what I have here is a little clip from JMP where we have here the symptoms text and you can see that this is just one report based on one person, one one patient. Okay, and the data is quite dirty. There is a lot of useful information in the narrative text, but you can see that there are spelling errors, typos, excessively long or even like a very brief statements. So the next two data sets that we have is the VAERS of vaccination data, as well as the symptoms. So one contains information regarding the vaccine, the other one is extracted from the symptoms text that we can see. Okay so given this accessibility, actually VAERS data has been mined quite a lot by the by quite a number of researchers, but, as you can see that the data is actually very challenging to use as the quality of the report varies. And there's also something that might not be genuine. So review of the power, which shows that some form of manual screening is usually employed to extract the required information. However, this is also quite labor intensive and quite difficult, so this paper aims to showcase the methods to extract the key information using text analysis techniques in JMP and try to do an explanatory model to explain the most important variables involved in this event. So what we did is that for each of these data tables over here we clean them individually and then join them using the VAERS ID. What we did based on the patient outcomes was to derive something that's called a severity rating, I'll talk a little bit about this a bit later. So once the tables are joined there are four narrative texts. One on the allergies, medication, medical history and symptoms. And then we will use text analysis techniques to extract the vectors for the top terms that...will explain the severity rating for each of the text data. And join them in the existing spreadsheet data structured variables on the data set. Okay, and then all this is compiled together and then you put it into model building. Okay so what is this the severity rating all about? It's based on the patient's outcome, the VAERS data has 12 variables that describes the status of the patient. And then, based on this, we have extracted the variables and try to make sense of it, so we came away four levels of severity and then we call this the severity rating. So next we will talk a little bit about how we use JMP Pro Text Explorer platform for text analysis and we start off with the data cleaning. So what we wanted to do was to really extract out the significant terms from the text data. And augment them to the structured variables to build your model. So as you can see, actually, the text data is quite quite messy so what we did was first of all, decide between using the term frequency, what kind of term frequency to use, and then the binary term frequency was selected, as the data shows that there's a significant advantage, of considering, of using it. So next a little bit about the cleaning that came in. So the the text data was first organized using the JMP Pro Text Explorer and we used a useful feature that is in there to add phrases and automatically identify the terms so what you can see it's like terms like white blood cells are kept as a phrase instead of being pulled into white blood and cells, which will not make much sense. And a few other methods as well to use. So one is the standard for combining which stemmed the words based on the word endings and then we also thought to sort the list alphabetically in order to recode like misspelled words or typo errors or what's that similar. Yeah and then the next thing was to use the very handy function to recode all the similar items together yeah. So the next thing we do after cleaning out the text was to look at the was the look at the workflow actually. The workflow is useful for stop word exclusion and to see the effect of the target variable on the terms. So what I did over here was to visualize the most frequent terms by the size and color it based on the severity rating. So you can see that the lighter colors belong to the less severe cases and darker ones are the other most severe ones, and you can like pick up, then the words is quite small. And it really shows that the common symptoms are not serious but we picked up terms by the cerebral vascular incident pulmonary embolism and things like that. So these are related to the most severe adverse event. The next thing we use the term selection, so the term selection is new feature JMP Pro 16 which, which was quite timely. So it is integrates the generalized regression model into text analysis platform so following from the text analysis platform, you can just select this where term selection is. And then, it allows the identification of key terms that are important to the response variable. So our response variable is the severity rating. So why use the generalized regression model? So it is widely used for non normally distributed or highly correlated variables. Where the data are independent of each other and show no other correlation. So this method over here is useful for us because it fits our our data set. And each role that we have inside our data table is a patient and all those are independent of each other. So, and then the most important thing is also that the generalized regression model allows for variable selection, so that is what we want to do because we want to pick up the variables with the highest influence on the response variable. Yeah so a little bit more detail about this regression model there's a few options that we can use over here that's the elastic net, as well as the lasso. So over here are the different thing about these is the lasso tends to select one term from a group of correlated factors, whereas the elastic network net will select a group of terms. So generally, I think that elastic net is used, and then over here that's our choice of the binary term frequency came in. Okay, so this is the result of the term selection, so you can see that over here that shows you the overview of the (???) and then generalized (???) but more interestingly when you started by the coefficient you can see that these are the top positive coefficients So these are top factors that plays the biggest role in terms of our response variable. And this one over here are the symptoms that plays least role when sort that according to the coefficient. So looking at the results, you can see that cardiac arrest and COVID-19 pneumonia, cerebral vascular accidents just all the terms that affects the response variable. So we can see that terms of more serious nature are related to the heart and lungs okay as versus the more low frequency ones right, which are very, very mild symptoms, really. Okay, so we repeat this whole process for all of the other for all of the other text variables, so we have gone to the the example for symptoms, so there's also the allergies, the medical history as well as the medications that are used, so what we did later was to save the document term matrix. Okay, which is basically the DTM is saves a column to the data table for each time. So you can see over here an example, you mainly a lot of zeros because it's a very sparse matrix so one will indicate the presence of let's take this column over here one will will indicate the presence of (???).. So we save it and repeat the process for the other text analysis. And then, once we have all these terms saved up we moved on to modeling. So therefore modeling was to build in kind of like a validation column so over here, we went to predictive modeling and make validation column. So over here we selected the choices so put it as validation set up 55%. And the whole thing over here was to identify the important variables with severity as a response variable So all in all we have seven structure variables and 55 that were derived from document term matrix and a total of about 31,000 rules. And what we see is that there's an imbalance there because of a severity rating you get an unbalanced data set. So because of this, we done our model evaluation on comparing the R split and the AIC values. So. We use the fit model in JMP so and choosing the generalized regression model again. And we can see that these are the results here so separate models you think the group of generalize linear models, using the penalized regression techniques were prepared. And then we try to fit based on the various characteristics over here, these are all the other other the penalized estimation methods. So of all the models, we can see that the lasso method has the lowest sorry has the highest Rsquare value, and there are other values that quite close as well, so we are going to take a closer look at them. So comparing the maximum likelihood model, as well as the lasso model based on the ROC curve, you can see that actually both of the ROC curves are quite similar. And however the ROC curve for the maximum likelihood model shows that it has the highest severity. Sorry, ROC curve for the maximum likelihood model shows that the ROC value is higher for highest severity rating, and you can see that it's only a slight difference here between both of these. And in general as as you go down the severity rating the area actually do increase and one of the reason is because our data set is very unbalanced. So the severity rating of four, which is the highest level, the most of your level is only about 5% of the total data values okay so overall this actually very little difference between both of these so we choose the one with a slightly larger area. Okay, so our next be turned on to the effects test. So into the report, you can choose to see the effects test, so the effects test is the hypothesis test of the null hypothesis that the variable has no effect on the rest. There was this very nice explanation of the effects test on the JMP Community, I think it was contributed by Mark Bailey. So he talks a bit about how the effects test is actually based on the Type III sum of squares for ANOVA. So we can see that the effects test is very suitable over here because of our data set so it actually tests every term in the model in terms of every term in it. So the main effects are tested in a lack of the interactions between the items and in the light of the other terms, in the light of the other main effects as well. So what do you want to use here is that we see over here is that the effects test is useful for our purpose, as it is for model reduction. And, and it allows us to draw inference of the long list of significant variables. We look at the probability at ChiSquare (???) lowest ChiSquare value taking a cut off alpha value of 0.1. We have a number of independent variable so that's quite a long list of them, and most of them, as you look through most of them actually related to the cardiopulmonary illnesses. So some of them are the effected ones like the number of symptoms, the number of days between the vaccination onset. (???) is the more in which the vaccine centers that by each and then you can see that the rest of them are related somehow another to. cardiopulmonary illnesses, there are some strange ones that I don't come from a medical background, so I don't really understand it either, but you can see that deafness is one of them, so there are some strange results that we can see over here, but in general that's, the picture is that, in terms of the top variables in terms listing variables. Okay besides this, right, what is really interesting is look at the model evaluation so even though what we're doing is to build an explanatory model, I went to look into the predictive model as well because JMP has very nicely put report over there for me to look at the parameter estimates so So I use the Profiler to try to understand the parameter estimates and you can see over here that the values shown are really, really small, so this is the value that you get immediately when you open up the Profiler, so the values here are the average of each variable and you can see that each of these variables Of the each of these values here actually very small, so it means that there's very little effect on the severity based on these coefficients. Based on these as a coefficient of the predicted variables. So what you can see that based on this study over here, you can tell that actually (???) symptoms and its effect has very, very little effect on severity and this is kind of like. kind of a within expectation to see that most of these symptoms and effects, because we are looking at the general picture of the vaccine, we can see that most of these symptoms - medical history, allergies - have very little effect actually on the outcome of the of the vaccine. Okay so. yeah. So a little bit of a conclusion, a few statements as a concluding statement. Several decisions were made in the grouping and classification of variables. And although these variables were made to the best of our understanding, especially in the way in which we came up with a severity rating, We perhaps need an expert familiar with vaccines studies or clinical trials to be consulted as to whether or not the severity rating is sufficient to to score the adverse events outcomes. And based on the model building of structured and unstructured data we have identified key factors that varies with the severity in a reaction to a COVID-19 vaccination. However, we're still not the effect of these key variables on the response variable severity is very small, so this is seen by looking at the variables. And then, finally, the document term matrix based on the binary ratings, the binary term frequency was found to be the most effective in representing the weights other terms in the document. And the generalized linear model with the lasso penalized regression technique produced the optimal model. So I hope you enjoyed the very short presentation and do let me know if you have any questions or any feedback, thank you very much. Peter Polito great job. was very. Oh no. I just realized that i'm you know mistake we wanted this life oh gosh. Peter Polito So that this is the exact situation where well you're. So at the actual discovery summit there's going to be a presenter. And so they're going to reach out to you, ahead of time and you just say, I have a mistake on one of my slides, and so in this part comes up it'll just pop he or she will pause it. And then you can share the slide and talk about it and then go right back to the video so don't worry about it at all. This happened quite a bit during last year's discovery summit is not a problem at all. Okay okay. Well gosh oh James. Peter Polito Would you like to fix it and redo it would that make you feel better. I don't know. If it's actually this one here, because this is the wrong box, it will take me a while to actually fix it because I need to retype it over yeah. Peter Polito yeah then it didn't know don't worry about it it'll be a real easy fix and you can do it in real time. Okay okay. Okay yeah. Thanks for sitting, through it, though. Peter Polito yeah no problem is great, I really. Okay. Peter Polito All right, any other questions or comments or anything. yeah i've got one is that um so what's up a link, where I can upload all of my slides and my people and things like that, but there was a mixup with my. email, and I think from tanya about that that when tanya replied me right, I think she missed out on that link, so I thought that the link will be embedded inside one of the recording but I don't suppose you guys got it right. Peter Polito I don't have it, but I will reach out to tanya and asked her to reach out to you directly. To help remedy that. Okay sure thanks very much I think that makes up with my email yeah. Peter Polito Thanks very much. No problem alright well have a have a good night and good rest. yeah you have a good day. Peter Polito Thank you. So much bye bye.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

Amanjot Kaur, Statistician, Perrigo Rob Lievense, Senior Systems Engineer, JMP Formulation scientists put forth significant hours of work attempting to find an extended release formulation that matches drug release targets. For generic (ANDA) products, the primary objective is to match critical quality attributes (including dissolution over time) to the reference listed drug (RLD). This paper illustrates how functional DOE is an extremely robust and easy-to-use technique for optimizing to a target dissolution profile rapidly with fewer resources. Dissolution data is collected over time during development and in-depth analyses are required to understand the effect of the formulation/process variables on the product performance. Regression models for specific time points (e.g., 1, 2, 6, and 12 hours) have been typically used to correlate the release responses with the formulation/process variables; however, such analyses violate the assumption of independence for the responses. There is great need for robust statistical tools used to determine the levels of inputs needed to get the closest profile of the developmental product to the reference drug. Functional DOE in JMP for model dissolution is a new and critical tool to use within a drug development program. Auto-generated transcript... Speaker Transcript Bill Worley or. Is. It is. Thank you so much for tuning into today to watch Rob and I's Discovery presentation. Today we are discussing the topic, a new resolve to dissolve, which is modeling drug resolution data using functional DOE. First of all let me introduce myself. My name is Amanjot Kaur. I am statistician with Perrigo Company. And my co author, who's joining me here, is Rob Lievense. First of all, we'll talk about...I'll give you a little bit introduction about the topic that we are discussing today, followed by introduction to the dissolution testing and why it is important. And then we'll discuss and compare the previous methods that we used for finding the best input to match the target dissolution profile and compared it with functional DOE that is now available in JMP Pro. Like I mentioned, I work as a statistician with Perrigo, which is a pharmaceutical company and very, very good...has a large share...market share of over the counter drugs and generic drugs. We regularly deal with the solid dosage form, which are developed into new products. Formulation scientists in these pharmaceutical companies, they put in a lot...a lot of time and effort in attempting to find candidate formulation to match target dissolution profile of an extended release tablets. These days extended release tablets are really...they are more... more in demand, as compared to immediate release, as you can see in this slide, that if you're taking an immediate release tablet you're in 24 hours, you will be taking eight to 10 tablets, and as compared to extended release tablets, you will be just taking two tablets in 24 hours. So the extended release tablets are preferred over immediate release tablets. When I say target dissolution profile that can be a currently marketed drug known as orallly or reference listed drug, or it can be a batch that is used for clinical study, known as a bio batch as well. And the data that is collected during the formulation development of generic products, it is all submitted to FDA in NDA (abbreviated for new drug application), which is really common in my work. The main primary product objective during these this formulation development is to match all the critical quality attributes, including dissolution over time to the ???. So let's take a minute and see...learn little bit about dissolution testing and why it is important. When we take any medication what happens in the human body is that solid dosage form, those will release the active drug ingredient, and it will process...and will process the drug out of the body in a given rate. This is shown in the clinical studies. That a peak results when the maximum amount of drug is present in the blood, how long that those are sustained and the eventual decline of the drug in the bloodstream as it is excreted out of the body. The laboratory methods that are utilized to monitor the quality of the product, they do not have the same mechanism, but they try to replicate as much as to the human body. And there are multiple techniques that are used, however, all of...all them typically involve the release of the active drug ingredient in media, and that is measured as a percentage of the total dose. In the extended release formulas, they utilize formulation scientists, they utilize materials and processing methods to assume that a specific amount of drug is released quickly enough to be effective, with the slow release required over the time to maintain the drug level against the rate of the excretion. Now that we know about dissolution testing, now we can take a look at the methods that we used previously to analyze this dissolution data. Similarity profile...similarity of profiles that is...that was done by graphing average results of candidate batches to the product. We usually use two methods. The first one is F2 similarity criteria, which is used to compile the sum of square differences of percent released in media of multiple...for multiple time points. Scientists typically they rely upon their principles and experience to create trial batches that will hopefully be similar to the target, so it's hit and trial method. In this method a value of 50, or higher than 50, that is desirable to indicate that batches...the batches are, at most, plus or minus 10% different from the target profile at the same time points. The second approach that we use is more advanced approach that came about through utilization of quality by design in pharmaceutical industry. That is multivariate least squares model that comes from designed experiments. So, as you may know, least squares methods create an equation for how each input...input influences the dissolution output for designated time points. So for extended release formulas we typically look at one hour, two hours, six hours and then 12 hour time release. The prediction profiler that is available in JMP that provides the functionality to determine the input settings to obtain the comparison of the best results for all the solution time points. So now question is, if we have these capabilities, why we are looking at the new approach? The problem with these methods that we have today is that F2 similarity trials and multivariate least squares, both methods they treat the time points of the dissolution profile as independent outputs, and we know that the release of those at one hour that will affect results at the later time points as well. That's why we need a new approach, and, secondly, the functional DOE, it will treat all the time points as dependent time points, and it is an extremely robust and easy-to-use technique to optimize our target dissolution profile rapidly just with few resources. Let me just quickly show you in one example of development project using multivariate least squares regression method. So, as you can see here in this data table, this is...this was a DOE created for one project, and we have 12 batches (12 batches?) 12 batches in here. The first, the main compression force, polymer A and B there, these are the three input factors, and different time points at 60 minutes, 120 minutes, 240 and 360. We have all the time points here. And, if we look at the least squares fit here, you can see, our main effects and interaction, they all are pretty significant and if you just scroll down to the to the end of the report, you'll see a prediction profiler will...where it will give you...where we have all the setting....we have already set goals, what we want and we can maximize our desirability and it will give you a setting showing you that this is the desired setting that you need to get your desired profile or a match to the target, if you want to say. So this is what we get in least square fit. My former colleague, Rob Lievense, will show you functional DOE, which we believe is a way...much better way to optimize the formulation or process. Thanks AJ. You did a really good job of explaining all the work that we did changing to a quality by design culture and getting to the multivariate models. Now we're ready for the next step. I'm Rob Lievense and I am a senior systems engineer at JMP but I used to be in the pharmaceutical industry for over 10 years and wrote a book on QbD and using JMP. So I want to show you this topic of functional data exploration, specifically functional data using a DOE. This works really well for dissolution data. For functional data analysis, we need to have the table in stacked form, that works best. So I have a minutes column, I have the dissolve for the six samples at each time point, I have the batch and I have my process inputs. Now what I'm going to want to do is take a look at this first as dissolution by minutes. So one of the things I can see is, here's my goal, here's all the things that I'm trying with my experimentation. It really becomes obvious that these are dependant curves. Whatever happens in early time points has influence on later time points, so it really is silly to be able to try to model this by just pulling in the time points and treating them as independent. We can utilize more of the data that comes from the apparatus in this way. This helps us develop the most robust function; the more pulls we have, the better it's going to be. So I'm going to run my functional data exploration here. We have the amount dissolved. We have to put in what we have across X, if it's not in order...row order, but I have minutes, I'm going to throw it in there. Batch is our experimental ID. And then these inputs that change as part of our DOE, we're going to put in there as supplementary variables. What JMP does is it looks at the summary data. We can see that the average function is this kind of release over time, which makes total sense to me. I also can see that I have a lot of variability, kind of in that 60 to 120 minute time frame, which is fairly common. And I have some ability to clean up my data, but I happen to know this data is pretty solid, so I'm not going to mess around with that. What I do need to do is tell JMP which is my target, and my target is my reference listed drop. Now I'm ready to run a model. There are various models available, but I've used b-splines with a lot of dissolution data, and it seems to work very, very well. What JMP is going to do is it's going to find the absolute best statistical fit. This doesn't make any sense to my data, I know that my concentration of drug in media grows over time. It never dropped, so having these inflection points within the sections that I picked this function apart make no sense. All these areas are knots, and this is how we break apart a complex function into some pieces, to be able to get a better idea of how to model it. Well, I can fix this. I know that cubic and quadratic just make no sense, and I happen to know that six knots is going to work quite well, so I'm going to toss that in there. I can put as many as I want. Now JMP still gives me those nine knots. I need to have some subject matter expertise here. I think I can do this in six. I can see I don't gain a whole lot of model fit by going beyond six. But I do want enough saturation in this lower area, because this is where dose dumping might occur. This is where I'm really interested in determining if I'm having any kind of efficacious amount in the bloodstream. So I'm going to set that update. Now that one's not so great. I'm going to try again. Alright, so I get a very reasonable fit for this setup and I've got my points really where I want them. And I take a look at that, that makes a lot of sense. Now what JMP has done mathematically is it's seeing for this average function, there's an early high rate of increase, that's 83% of the explanation of the shape of this curve, what changes the shape of this curve, if you will. And I also see there's about a 15% influence of a dip. And I can tell you this is likely due to the polymers; I have fast and slow acting polymers so that makes total sense. And then we have another one that's maybe a very deep dive. Now we can play around with this if we want, but I'm just going to leave this with the three Eigen functions. Now this is my functional data analysis, the prediction profile we get is expressed in terms of the functional principal components, which can be somewhat difficult to interpret. We're going to move forward and we're going to launch the functional DOE. We do that, we can see that our inputs are now the inputs to the process, so we can see how changes to these inputs have an effect on our dissolution. But what we want to do is, we want to find the best settings. So what we can do is go into the optimization and ask JMP to maximize. And what JMP is going to do is it's going to find the absolute minimum of the integrated error from the target RLD. That's going to be the absolute closest prediction to our target. We can see that we have about 1,800 compression force, about 12% Polymer A and about 4% Polymer B. And as we move these to different points, we can see what the difference is from target, so this one has about 1.54 for 60 minutes. And at 120, it drops to negative 1.2. At 240, we get to negative .7. So it gives you an idea of how far off we are, regardless of where we are on the curve. It's time for a head to head comparison. Since we have more points in our functional DOE, we're going to use this profiler to simulate because I don't have the ability to make batches, but this is going to be as close an estimation as we can get. In the simulator, we can just adjust to allow for what we can see happening in the press controllers, as far as the amount of variation we see in main compression force, and we can adjust for some variation in Polymer A and Polymer B. Once we've done that, we can run five runs with the FDOE settings that are optimum and we can run five runs with the least squares optimum settings, and let's see how they compare. The simulations allowed us to compare what is likely to happen when we run some confirmation runs. And the thing that we see is the settings shown for the least squares model, which assume independence, are really not the settings we need. We have some bias. We don't get a curve that is as close to the target as we possibly could get. We use the FDOE, our optimum runs are very, very close to the curve and you can see that our main compression, our polymers are quite a bit different between those two results. Thank you, Rob, for explaining that. So there are some considerations that we need to keep in mind when we're using these methods. First of all, a measurement plan that should be...must be established with the analytical team or the laboratory team to ensure that there are enough early pulls of media to create a realistic early profile. And when we say early profile, it's before 90 minutes or so. And secondly, the accuracy and precision of the apparatus, that must be established to know the low limit as very small amounts of media may not be measured accurately. And the third one is the variation of the results within the time points, that must be known as high variability if it's more than 10% rst because that may require some other methods. So, now that we have established this method for the next steps, we would like to establish a acceptance criteria. We found that the model error of...for the functional DOE that seems to be greater at the earlier time points. That may be due to the low percent dissolved and the rapid rate of increase, so that is creating the high variability. And the amount of model error is critical for the establishment of acceptance criteria. There is likely a cumulative contribution of FPCs low...too low for practical use and the integrated error from target might provide evidence for acceptance. And the last one, creating a sum of squares for the difference from target of important time points that could allow for F2 similarity used for acceptance. However, some more work is needed to explore this concept. Well, thank you so much for joining us and we hope that this talk, this approach will be useful in your work. And we're going to be hanging out for live questions but we're very interested in your feedback on this method, especially any ideas on how to establish acceptance criteria.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

Laura Castro-Schilo, JMP Sr. Research Statistician Developer, SAS The structural equation modeling (SEM) framework enables analysts to study associations between observed and unobserved (i.e., latent) variables. Many applications of SEM use cross-sectional data. However, this framework provides great flexibility for modeling longitudinal data too. In this presentation, we describe latent growth curve modeling (LGCM) as a flexible tool for characterizing trajectories of growth over time. After a brief review of basic SEM concepts, we show how means are incorporated into the analysis to test theories of growth. We illustrate LGCM by fitting models to data on individuals' reports of Anxiety and Health Complains during the beginning of the COVID-19 pandemic. The analyses show that Resilience predicts unique patterns of change in the trajectories of Anxiety and Health Complaints. Auto-generated transcript... Speaker Transcript Lauren Vaughan See. Okay, we are recording okay all right, if you could please confirm your name company and abstract title. Laura Castro-Schilo, JMP Laura Castro shiloh JMP SAS and my abstract okay so it's a modeling trajectories with structural equation modeling and that's it right. Lauren Vaughan yep and um let's see you do understand this is being recorded for use in the JMP discovery Summit Conference, and you will, it will be available publicly in the JMP user community do you get permission for us to use this recording. Laura Castro-Schilo, JMP Yes. Lauren Vaughan Excellent OK, I will turn it over to you Laura. Laura Castro-Schilo, JMP Right, and so I just. have to share my screen somewhere here right. Now. But. Lauren Vaughan perfect. Laura Castro-Schilo, JMP two seconds. Hi, everyone. I'm Laura Castro-Schilo and today we're talking about modeling trajectories with structural equation models. And we're going to start this presentation by first answering the question of why we would use SEM for longitudianal data analysis. And then we'll jump into a very brief elevator version of an introduction to SEM. If this is your first exposure to SEM, I strongly encourage you to look for some of our previous presentations in Discovery Summits that are recorded and available for you to watch, so that you can get a better understanding of the foundations of SEM. But hopefully even without that intro, hopefully, this brief version will set you up to understand the material that we're going to talk about here today. And that introduction we're going to focus on how we model means in structural equation models. We're going to see that means allow us to extend traditional SEM into a longitudinal framework. And modeling those means will have implications for how our path diagrams look like, and also we'll see how those diagrams map on to the equations in our models. We're going to focus specifically on latent growth curve models, even though we can fit a number of different longitudinal models in SEM. And then we'll use a real data example to show how we model trajectories on anxiety and health complaints during the pandemic. And at the end we're going to wrap it up with a brief summary, and I'll give you some references in case you're interested in pursuing some longitudinal modeling and you want to learn more about this topic. So Singer and Willet are two professors from Harvard's School of Graduate Education, and I think they said it best when they claimed in a popular textbook of theirs that SEM's flexibility can dramatically extend your analytic reach. Indeed, this is probably the most important reason why you might want to use SEM for longitudinal data analysis. Now, specifically when we're talking about flexibility, we're referring to the fact that you can fit a number of different models in SEM that are longitudinal models and can be quantified in terms of fit and can be compared empirically, so that you can be sure that you're characterizing your longitudinal trajectories in the best possible way. There's a number of different models that we can figure, you can see them listed there and we can, you know, things like repeated measures ANOVA, which can make some pretty strong assumptions about the data. SEM allows us to relax some of those assumptions and actually test empirically whether those assumptions are attainable. SEM is also really flexible when it comes to extending the univariate models into a multivariate context. So if you're interested in looking at how changes in one process influence or are associated with changes in another process, SEM is going to make that very easy and intuitive. Now we know SEM has a number of nice features, and all of those apply in the longitudinal context as well. Things like the ability to account for measurement error explicitly, to be able to model unobserved trajectories by using latent variables and also using a cutting edge estimation algorithms for when we have missing data, which actually happens pretty often when we have longitudinal designs. Another interesting feature is that it allows us to incorporate our knowledge of the process that we're studying. So we'll see that that prior knowledge about what we expect the functional form in our data to be can be mapped onto our models in a very straightforward way. But there's also reasons why we should not use SEM for longitudinal analysis. I think, most importantly, the structure of the data is what might limit us the most. So in SEM we're going to be required to have measurements that are taking up the same time points across all of our samples. So say if, for example, we're looking at anxiety and we have repeated measures... three repeated measures over time, the structure of the data have to be like what I'm showing you here right now, where, you know, we might have anxiety at one occasioin and that's represented as one column, one variable in our data tables, and then we have anxiety at a second time point and at a third time point. So what this means is that everybody's assessment of the first time point has to have taken place at the same time, and that's not always the case. And so there's going to be other techniques that are more appropriate if, in fact, your data are not time structured. We also have to acknowledge the assumption of multivariate normality. Sometimes we can, you know... SEM might be a little robust to this assumption, but we still need to be very careful with it. And it's also in large sample techniques. So that data table I just showed you, you know, we really want to have substantially more rows than we have columns in the data, and this might not always be the case. So just as a reminder, if you haven't been exposed to SEM, this is also a nice brief intro, is that in SEM, well, one of the most useful tools are called path diagrams, and these are simply a graphical representations of our statistical models. And so, if we know how diagrams are drawn, then it'll be much easier for us to use them to specify our models and also to interpret other structural equation models. So these are the elements that form a path diagram and you can see here that a square or rectangle are used exclusively to denote manifest or observed variables in our diagrams. And that's in contrast to unobserve variables, which are always represented with circles and ovals. Now arrows in path diagrams are... they represent parameters in the model. So double-headed arrows are always going to be used for variances or covariances, and one-headed arrows represent regressions or loadings. In the context of longitudinal data, there's another symbol that is really important, and that is a triangle. The triangle represents a constant, and it's used in the same way that you use a constant in regression analysis, meaning that if you regress a variable on a constant, you're going to obtain its mean. So we model means and we put some constraints in the mean structure of our data by having a constant in our models. So let's take a look at a simple regression example. If you wanted to, you know, fit a simple regression in SEM, this would be the path diagram that we would draw. So you can see X and Y are observed variables, we have X predicting Y, with that one-headed arrow, and both X and Y have variances. In the case of Y, because it's an outcome, that's a residual variance. And we also have to add the regression of Y on the constant if you want to make sure that we get an estimate for the intercept of that regression. So here, this arrow would represent the intercept of Y, and notice that we also have to regress X on that constant in order to acknowledge the fact that X has a mean. And now we can use some labels, so that we can be very explicit about which parameters these arrows represent. And then we can see how those arrows...so we can trace the arrows in the path diagram in order to understand the equations that are implied by that diagram. So let's focus first on Y. You can see that we can trace all of the arrows that are pointing to Y in order to obtain that simple, you know, regression equation. We have Y is equal to tau, one...times one (which is just that constant so we don't have to write the one down here) plus beta one times X (which we have right down here) plus the residual variants of Y. Now we also can do the same for X, because in SEM all of the variables that are in our models need to have some sort of equation associated with them. And here we want to make sure that we acknowledge the fact that X has a mean, so we regress that on the constant and it also has a variance. So again, those path diagrams are away to depict the system of equations in our models. And those diagrams, it's very important to understand that they also have important implications for the structure that they impose on the various covariance matrix of the data and on the mean vector. And I think it's easiest to explain that concept by actually changing the model that we're specifying here. Rather than having a regression model, what I'm going to do is, I'm going to fix all of those edges to zero. So all of these effects, I'm just going to say I'm going to fix them to zero, which is the same as just erasing the errors from the diagram all together, and now you can see how the equations for X and Y have changed there. This is a very... well, this is a very interesting model. It's simple, but it actually has a lot of constraints, right, because it implies that X and Y have a variance but that their covariance is exactly zero; there's nothing linking these two nodes. And it also implies that the means for both X and Y are exactly zero, because we're not regressing either of them on the constant in order to acknowledge that they have a non zero mean. So now we if we really want to fit this model to some sample data, then that means we have some samples statistics from our data. And the way that estimation works in SEM is we're going to try to get estimates for our parameters in a way that match the sample statistics as closely as possible, but still retaining the constraints that the modal imposes on the data. And so, in this particular example, if we actually estimate this model, we would see that we are able to capture the variances of X and Y perfectly, but the constraints that say that the covariance is zero and that the means are zero, those will still remain. And so the way in which we figure out whether our models fit well to the data is in fact by comparing this model implied covariance and means structure to the actual sample statistics, and so we can look at the difference between those and obtain our residuals. And these residuals can be further quantified in order to obtain metrics that allow us to figure out whether our models fit well or not. Okay, so that's our intro for SEM, and these are going to be the concepts that we're going to be using throughout the presentation, in order to understand how we model trajectories with SEM. Now what better way to start talking about trajectories then to imagine some data that actually have some trajectories. And so I want you to think for a second, how anxious are you about the pandemic? If it had been asked of you early in 2020, when the pandemic was first started. And perhaps a group of researchers approached, you they asked this question, and then they came back a month later and asked you the question same question again. And maybe they came back a couple months later, and also asked about your anxiety. So we might obtain this data from a sample of individuals and the data would be structured in the way that is presented here, where each of those time points would be different variables in in the data. And now let's imagine that we have the interest of looking at some of the trajectories from that sample, and we want to plot them so that we can start thinking about how we would describe these trajectories. So let's take three individuals. This is going to be a fabricated example just to illustrate some concepts, but imagine that the first individual gives us the exact same score of three at each of the time points that we asked this question. And maybe in this example, maybe anxiety ranges from zero to five, where five means there's...you're more anxious about the pandemic. So for this individual, the trajectory of this person is perfectly flat, right. It's a very simple trajectory, and maybe for a person...individuals two and three, you know, maybe we get the exact same pattern of responses. And so, if this were to be real, and we had to describe these trajectories to an audience, it would actually be really easy to do that, right, because we could just say, there's zero variability in the trajectories of individuals, and really, just describing a flat line would would do the rest, right. So we can use the equation of the line to say, you know, anxiety at each time point takes on these values. And we would have to clarify, right, that the mean, or rather the the intercept for this line is equal to three and the slope is zero, so that we really just described that flat line. Well that'd be a really easy to do, but of course this is a very unrealistic pattern of of data, so we're not expecting that we would observe this in the real world. So let's imagine a different set of trajectories where there's actually some some variability on how people are changing. And in this case, we could still find an average trajectory, right, a line of best fit through these data. And if we only use that the equation of that line to describe the data, we would really be missing the full picture, right. That would not do a very good job of showing that some individuals, you know, number one is increasing, whereas individual three is is decreasing. So instead we have to add a little more complexity to that equation we saw earlier, in order to account for the variability in the intercept and the slope. So again, if we had to describe this to an audience, one thing we can do is in this equation, I'm adding a sub index I to represent the fact that anxiety for each individual at each time point can take on a different value. Now notice that the intercept and the slope for the equation also have that I, indicating that we can have differences...you can have variability on the intercept and the slope, and we can still use the average trajectory to describe the average line, right, such that that intercept can still be three and the slope is zero. But notice that we add these additional factors here that capture the variability of the intercept and the slope, and, specifically, these are the values for each individual that are expressed as deviations from the average trajectory. And then we'll see that we're going to have to make some assumptions about those factors in terms of their distribution, which should be normal with a mean of zero and an unrestricted covariance matrix. But even these trajectories are also quite unrealistic, right, because I'm showing you these perfectly straight lines. And when we get real data, it's never ever going to look that perfect. Indeed, these three trajectories are much more likely to look like this, right, where even if we are assuming that there is an underlying sort of an unobserved linear trajectory, those are not the trajectories we observed. In other words, we have to acknowledge that any data that you observed at any given time point is going to have some error, right. And so we're still able to capture that error into our equation and we'll make some assumptions about that error being normally distributed. But again, the idea is that we have these unobserved error free trajectories and that's not what we really get when we are observing the individual assessments, right, into in our data. So our equation is going to describe that average trajectory and it's also going to describe the individual trajectories as departures from the individual line...I'm sorry, from the average line. Alright, so not everything that we have described so far is actually known as a linear latent growth curve model in SEM. And if this looks like a mixed effects or random coefficients model, if you're familiar with those, it's because it is actually very, very similar. Now we only have three time points here, so this is a very simple linear growth curve, but we can still have, you know, more complex models that incorporate some nonlinearities if, in fact, we have more time points so that we're able to capture those nonlinearities, and we can do that for polynomials and there's other ways actually to capture nonlinearities in growth curve models. Today we're going to keep it very simple and we're going to stick to the linear models, though. All right now, I want to bring it all together by really showing you how those equations of that linear latent growth curve model, how they can be mapped to a path diagram that can be used to fit our structural equation models. And so we're first going to start by by using the simplest equations here, the equation for the intercept and the slope. And remember that that intercept and slope represent unobserved values, right, represent unobserved growth factors, and so we're going to use latent variables, these ovals, to represent them in our path diagram. And notice that the intercept is equal to a mean, plus that variance factor, right. And so that is why we regress the intercept on the constant in order to obtain it's mean, and we also have this double-headed arrow in order to represent that variability in the intercept. And we do the same for the slope. Now notice, we also have a double-headed arrow linking the intercept to the slope, and that is to represent the covariance, right, that we make that assumption over here. And it just means that we're going to acknowledge that individuals that perhaps start higher on a given process might have an association to how they change over time, okay, and that is what this covarience allows us to estimate(?). Now, ultimately, what we're modeling is our observed data, right, our observed measurements for anxiety, and so here is the full path diagram that would characterize the linear growth curve. And notice, I'm going to focus on one anxiety time point first, that first time point, and again using the idea of tracing the path diagram, we can see how anxiety at time one is equal to one times the intercept, which is right here in this equation, plus zero times the slope, so this part just falls out, plus that error term. So in other words, what we're saying is that anxiety at time one is simply going to be the intercept of that individual plus some error. And then we can do the same, tracing the path diagram, to see what's the equation for anxiety at the second time point. You can see that it's, once again, that intercept plus one times the slope, so here is basically the initial value of that person, the intercept, plus some amount of change. And then, at the third occasion is basically again the...just by tracing this, we see that the equation implies that we have a starting point, which is the intercept plus now is two times the slope. Right, so notice how in these latent variables, the factor loadings are fixed to known values, and we are fixing those values to something that forces these trajectories to take a linear shape. So here the factor loadings of that slope is basically the way in which time is coded in the data, and this is the reason why everybody in SEM actually needs to have the same time point for a measurement, right, because everyone that has the value of anxiety at time one is going to have that same time code, which is embedded into the way in which we fix these factor loadings. Alright, so now, in this particular specification can actually work, you know, perfectly fine if we have, for example, yearly assessments of anxiety. But notice here what I'm emphasizing is that there's equal spacing between the time points, right, and that's important because, in order for this to really be a linear growth curve, there needs to be equal spacing here. But obviously this could be weekly assessments or they could be assessments that are taken every month and that's fine. This is going to work out great. Now, it could be that you don't have equal spacing and that can also be handled fine in SEM as long as everybody has the assessment at the same time point. So here's an example where there's one month spacing between the first measure of anxiety and the second one, but then from the second to the third, there were two months, and so what we have to do is fix the loading of that last...the slope loading here, instead of two, it has to be now fixed to three, right, in order to capture...and notice from one, we jump from one to three and that's what assures us that we still have a linear trajectory here. Alright. So it's time for the demo, and what I want to share with you is some data that come from the COVID-19 Psychological Research Consortium. It's a group of universities that got together and wanted to really start collecting longitudinal data to understand the extent of the damage really that the pandemic is having on people's mental health and even their physical health. And so we have three waves of data. And these are from a subsample of the UK, and just like I showed you in that previous slide, the repeated measures are in fact from March 2020, and then a month later in April, and then two months later in June. And we're going to be looking at repeated measures for anxiety. The survey for anxiety could vary from...could range the scores from zero to 100, where 100 means higher anxiety. And then we're also going to look at health complaints over time. Those could range from zero to 28, whereas, you know, higher score for percent more health complaints. And we're going to look at one time invariant variable which is resilience and this one was assessed at the beginning in March 2020. Okay, so let's take a look at the data. So I have the data right here. And notice, we have a unique identifier for each of our individuals, so each row represents a person. Actually, there's some missing data there that we're not going to worry about right now. But notice we have some demographic variables and then further to the right here, we have our data on anxiety and those are the repeated measures that we're going to focus on first. Now I do want to say that initially, you would want to, you know, plot your data with some nice longitudinal graphs, but we're going to skip straight into the modeling because I want to make sure we have time to show you how to use the SEM platform for these models. So I'm going to go to analyze, multivariate methods, structural equation models. And I'm going to use those three anxiety variables and I'm going to click on model variables and okay, in order to launch the platform. So notice that, as a default, we already see a path diagram that is drawn here on the canvas and we can make changes to that diagram in a number of ways. I usually use the the left list, the from and to list, where we can select the nodes in the diagram and we can link them with one-headed arrows or two-headed arrows, right. I can just show you here, so by selecting them, we can make some changes here. And I can click reset here on the action buttons, in order to get us back to that initial...initial model, and we can also add latent variables by selecting our observed variables in this tool list and then also adding latent variable here with that plus button. So nice thing for us today...and I'm sorry about my dog is barking in the background, but we probably have some mail being delivered. But the nice thing today for us is that we have this really useful model shortcut menu. And if we click on here, we're going to see that there's a longitudinal analysis menu with a lot of different options for growth curves. So let's start with the intercept only latent growth curve. And here the model that's being specified for us is one where each of our anxiety measures is only specified to load onto an intercept factor. And so this is one of those models where there's only a flat line, but we have a variance on the intercept acknowledging that individuals have flat lines, but they could have different intercepts for them. Now we don't know if this model is going to fit the data well. In many instances, it won't because it's a no growth model, and nevertheless, it's actually quite useful to fit this model as a baseline so that we can compare our other models against this one, right. And we do label the model no growth as a default here when you use that shortcut. So I'm going to click run and very quickly, we can see the the output here. There's two fit indices are really important for SEM. These are over here. The CFI is something that we want to have as close as possible to one, and you can see here, this is... this is pretty low. Usually you want to have .9 or higher, at the least. And RMSEA, we want it to be a most .1. We really wanted to be as close to zero as possible. And so, this is very high, and so, not surprisingly, it's a poor fitting model, so we're not even going to look at the estimates from it, because we know it doesn't fit very well. But we're going to leave it there because it's a good baseline to have in order to compare against. So going back to the model shortcuts, we could look at the linear growth curve model. And when I click that, I automatically get that slope factor added and notice that the factor loadings are there, and as a default, we just fix them to zero, one and two. Now the way in which this shortcut works, is that it assumes that your repeated measures are in the platform in ascending order. It's really important, because if they're not, then these factor loadings are not going to be specified like...they're not going to be fixed to the proper values. In fact, here you can see that June is fixed to two, but I know that there's two months in between April and June and so I'm actually going to have to come in here and make the change by selecting this loading and clicking on fix to, and I'm going to fix it to three, because I know that that's what I need to have to really have that linear growth curve. And so that's it. We're ready to fit the model and so I'm going to click run. And notice what a great improvement in the fit indices we have, right. The CFI is nearly perfect and the RMSEA is definitely less than .1, so this is a very good fitting model and we can now look at the parameter estimates to try and understand what are the trajectories of anxiety. The first bar we can see is the means of the intercept and the slope. They are statistically significant and they tell us the overall trajectory in the data, so on average individuals in March started with an intercept of 60, about 67 units, and over time on average, they're decreasing by about five and a half units every month. Because of the way that the slope factor loadings are coded, we know that this estimate represents the amount of change from one month to the next. Some of the very interesting estimates in this model are the variability of the intercept and the slope. And notice they're also substantial in this model, which basically means that, yeah, we have that average trajectory, but not everybody follows that trajectory. That means that some individuals can be increasing, while others are decreasing and others might be staying flat. And so a natural question at this point can be, you know, what are the factors that help us distinguish between those different patterns of change? And that is a question that can be really easy to tackle in this framework and we're going to do that by bringing in factors that predict intercept and slope. So on the red triangle menu, I can click on add manifest variables, and let's take a look at resilience as a predictor. So I'm going to click OK, and by default, resilience has a variance and a mean and that's okay, because I want to acknowledge has a non zero mean and variance,. but I want it to be a predictor, so I'm going to select in the from list, and I'm going to select intercept and slope in the to list. And we're going to add a one-headed arrow to link them together and have the regression estimates, so we can understand whether resilience explains differences in how people are changing. And so I'm just going to click run here, and we see that this is, in fact, a very good fitting model. And it has some really interesting results, because it shows that the estimate of resilience predicting the intercept, that initial value of anxiety is, in fact, statistically significant and negative. And it can be interpreted as any standardized regression coefficient, meaning that, for every unit increasing resilience, this is how much we should expect the intercept in anxiety to change, right. So the more resilient you were in March, the more likely you are to have lower score for your intercept in anxiety in March, so that's really interesting, but then again resilience in this model does not seem to have an effect on how you're changing over time. Okay, well, that's really interesting, but I really want to get to the idea of fitting multivariate models in SEM, so let's go back to the data. And I've already specified ahead of time...I saved a script that models again, just a linear univariate model of health complaints over time. So we have an intercept and we have a slope and I fit this model, you can see it fits very well as well, and so we can look individually at both anxiety and health complaints over time. And that is often times a good way to start to look at the univariate models first. And so here health complaints, as a reminder, could range from zero to 28, and we can see that the trajectory according to the means here, average trajectory is described by an overall intercept of about four and it has increases over time of about .3 units. And in this case, there seems to be significant variability in the intercept and not so much...not not for the slope, so people are generally changing in the same way. Overall, individuals seem to be increasing by .3 units every month in their health complaints. Okay, so now let's use this red triangle menu, and once again we're going to click add manifest variables, but what we're going to add are all three repeated measures for anxiety. So I'm going to click OK, and as a default, we're going to put the means and variances of anxiety, but I don't want the means of anxiety to be freely estimated. What I really want is for the means to be structured through the intercepts and slope factors. So I have to select those edges, and I'm going to remove them so that instead, what I'm going to start building interactively here is a linear growth curve that looks just like this one, but for anxiety. So I'm going to start by selecting all the three measures here, and I'm going to name this latent variable intercept of anxiety. I'm going to click plus. And now there's the intercept factor but notice as a default, we will fix the first loading to one for any latent variable. But because we want this to take on the meaning of an intercept, we actually want to fix these two loadings to one. I'm going to click here, fix those to one, and now we have to add the slope. So I select all three of them, and I'm going to say slope of anxiety. I'm going to click plus. Now that slope is over here. Again as a default, we fix this first loading to one, but I know that I want to code this in a way that that first factor loading is zero, so I'm simply going to select that factor loading and I'm going to click delete to get rid of it, because that's the same as fixing it to zero, and then I'm going to fix this loading to one. And that last loading needs to be three, in order to have that linear growth. Now we're almost done. Remember that the most interesting question that we'll be able to answer in this bivariate model is to look at the association of growth factors across processes. So we're going to select all of these nodes in the from and to list and we're going to link them with double-headed arrows. Those are going to represent the covariances across all of these factors, and the last thing we need is to add the means of intercept and slope for anxiety. So we're going to click over here, and that's it. We're ready to fit our bivariate model. I'm going to click run. And notice it runs very quickly. The model fits really, really well, and these mean estimates, once again, describe the trajectories for each of the two processes. I'm going to hide them, for now, so that we can interpret some of the other estimates with a little more ease. I think there's some really interesting findings here. You can see these values are in a covariance matrix, so we could actually change this to show the standardized estimates, just so that we can interpret these covariances in a relation metric. But what's really interesting is to see that there are positive significant associations between the intercept, that is, the the baseline starting values of individuals in their health complaints and how they're changing in their anxiety over time. In other words, the higher your intercept is, your initial value of health complaints, the more likely you are to have higher rates of change and anxiety. And we also see that positive association between the baseline values in health complaints and anxiety. And there's another positive association here that's really interesting, because this is a positive association between rates of change. So the more you're changing in health complaints, the more likely you are to be changing in your anxiety. So if you're increasing in one, you're increasing on the other, so that's really insightful. What again... we can still come back and add a little more complexity by trying to understand the different patterns of change in this model, so we can go to add manifest variables and look at how resilience impacts all of those growth factors. So I simply add it as a predictor here very quickly. The models do start to get a little cluttered, so we're going to have to move things around to make them look a little better, but this is ready to run. It runs very quickly. It fits really well and we could, you know, we could hide some of these edges, like we can hide the means and even the covariances for now, just so that it's easier to interpret these regression. effects. And so you can see that resilience has a negative association with both health complaints and anxiety at the first occasion. In other words, the more resilient you are in March, the more likely you are to have lower values in the health complaints and in anxiety, so that's really cool. And we also see here that for the rates of change, in the case of anxiety, the rate of change is not significant, the prediction isn't, but it is significant -- this line really should be a solid, because you can see that there is a significant association...negative association between resilience and the rate of change in health complaints, such that the more resilient you are, the more likely you are to be decreasing in health complaints over time. That's really interesting, especially when you tie a well-being or mental health aspect, like resilience, into something more physical, right, like that health complaints. Alright, so we're running out of time, but the very last thing I want to show you here, just because I really want to show you the extent to which SEM is so flexible and can answer all sorts of interesting questions. I actually fit a model that is a bit more complex, where I'm looking at three different predictors of all of those growth factors. And I also brought in measures of loneliness and depression in June at the last occasion. And what I did here, again I left this with all the edges, just so that you could really see the full specification of the model. But I can hide some of the edges, just to make it easier to understand what's happening here. What I did is I added loneliness and depression, and I'm trying to understand how the patterns of growth are predicting those outcomes, alright. So here you see those regressions. And we're also adding some interesting predictors like the individual's age, the number of children in the household, in addition to to resilience, as we saw before. And I could spend a long time just really unpacking all of the interesting results that are here. Without a doubt, you see, solid lines represent significant effects, so you can see that your patterns of growth and health complaints significantly predict depression at that last month in June. So that's, to me... I find that fascinating and you can also see how resilience in this case has a number of different significant... number of different significant effects on how people are changing over time. Here is is an interesting effect, where for every unit increase in resilience, we expect the rate of change in health complaints to decrease by .02 units, so it's a small effect but it's still a significant effect, so it's really interesting. And there's a number of things that you could explore just by looking at the output options. At the very bottom here, I included the R squares for all of our outcomes and you can see we're not explaining that much variance in the intercepts and slopeo factors here, so that means that there's still a lot more that we can learn by bringing additional predictors to this model. Okay, so let's go back to our slides, and I want to make sure that we summarize all the great things that we can achieve with these models. You can see that growth curve models allow us to understand the overall trajectory and individual trajectories of change over time. They allow us to identify key predictors that distinguish between different patterns of change in the data and allow to examine effects that those growth factors have on outcomes. And when it comes to multivariate models, it's really nice to see how how change processes... changes in a process can be associated to changes in a different process. Now it's important that we remember in our illustration that the data were observational, so we cannot make causal inferences, and also, we were using manifest variables for anxiety, but anxiety is an unobservable construct, so really just be aware that if you really wanted to, we, and if we had experimental data, we could use experimental data so that we could make causal inferences and we could have also specified latent variables for anxiety, such that we had more precision on our anxiety scores. Alright, so I think, even though we cannot make causal inferences, it's pretty fair to say that resilience appears to be a key ingredient for well-being, and so I want to make sure that this is the take home message today, because I think as the months continue to pass during this pandemic, we all need to find ways in which we can foster our resilience, so that we can, you know, deal with whatever comes as well as we can. And so with that, I want to make sure that you have some references in case you want to learn more about longitudinal modeling and I thank you for your time.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

Ryan Parker, Sr Research Statistician Developer, JMP In JMP Pro 16, the Direct Functional PCA (DFPCA) modeling option has been added to the Functional Data Explorer (FDE) to provide a way to perform functional PCA without first fitting basis function models. This approach not only makes larger functional data more tractable, but it also provides a more hands-off approach to analyzing functional data. This presentation details how DFPCA works and presents examples that highlight how and when to use DFPCA to analyze functional data. Auto-generated transcript... Speaker Transcript Brendan Leary you're right. Ryan Parker, JMP you're. here. I'm good. You can hear me OK. Brendan Leary I can hear you Okay, you can hear me okay. yep all right good you got the audio good nice to meet you. Obviously, doing this, you work for Chris or. awesome I'm. Ryan Parker, JMP Chief data scientist right, so I have that right. Brendan Leary that's a big title. right there. yeah. Brendan Leary And you're a senior research statistician I look around, and I think I've seen your face around but I haven't had the chance to meet you so nice to meet you. Ryan Parker, JMP Too So where are you. Brendan Leary Based out of New Jersey I'm going to take you for New York and New Jersey project. awesome. Brendan Leary I have fun in the field and I get to show off all the good stuff that you guys build so. I think you're perfect. So, as we get started here just a couple of things when I'm. let's see first things first your microphone looks good if all your cell phone computer notifications all that jazz turned off. Ryan Parker, JMP Like to. Be do not disturb yeah let me double check the computer. yeah do not disturb on everything. Brendan Leary Okay, I don't hear anything like if you have a fan going or anything. Ryan Parker, JMP I do, but you can't hear it okay. No. Brendan Leary You have any pets even put away so no no. Ryan Parker, JMP that's just. Just little kids, but they will not be around. Brendan Leary Had a dog incident, the last one I did that's all I'm gonna say. All right, alright let's see your display um are you on windows or MAC. MAC well do you know what your display settings it's probably fine if it's on a MAC usually they asked for 1920 by 10 at. Ryan Parker, JMP The. Brendan Leary Just for a resolution, but I imagine it's fine. Ryan Parker, JMP yeah should be OK OK, I. Brendan Leary want to show what you're going to use it just to JMP journal, or is it a PowerPoint as well. Ryan Parker, JMP So I'll just share the desktop is that the best way to do it. Brendan Leary yeah. Ryan Parker, JMP Or do you like to share. One at a time it's probably the easiest way to hide zoom. So let's start with this see how this looks. Brendan Leary Great. Ryan Parker, JMP Okay, this slide film all real quick just to make sure nothing weird pops up. Brendan Leary You know funky animation you're good. Ryan Parker, JMP yeah no weird animations and then I'll just go to JMP and openness state of setup and. And a launch from there. Brendan Leary Perfect Okay, I look I I think we're good, am I said, the only other thing I bring up to me, is before you JMP in and you start we're going to record it to detail is you know. So if you make a mistake, or you want to would have to start over. But just give a couple of seconds before you start maybe count down from three or five in your head. Okay, a little bit of white space at the beginning, in the end, so you can flip it. Ryan Parker, JMP Right sure okay. Brendan Leary And that that's all I got so I'm going to go ahead and hop on mute and, lastly, I work for JMPs I don't think I have to but I'm going to read it just to be safe. You understand that this recording is for use by JMP discovery summer conference and will be publicly available in the JMP user community do you give permission for this recording in us. Ryan Parker, JMP I do please share it. Brendan Leary widely reported, just in case just perkinson can't get mad at me I asked. Ryan Parker, JMP yeah you know somebody spouses like hey what you did what. Brendan Leary You know I know I know. cool Ryan, thank you I'm going to go on our video hop on mute countdown in your head and when you're ready go ahead and start. Ryan Parker, JMP Okay, and just so you know, I think that time myself to be right about 20 minutes. Because it's because I think it's a total 30 right, and then they wanted to give some. extra time. Brendan Leary So yeah brevity is perfect um you given time for Q amp a that that sounds ideal. Ryan Parker, JMP Okay, great. sounds great. Well, thank you for coming today. My name is Ryan Parker. I'm a direct functional PCA. And it's a new way of analyzing functional data. And I just want to acknowledge our chief data scientist Chris Gotwalt has played a major role in not only development of FDE but also this new tool, as well as our test team of Rajneesh, whose, you know, work usually goes unnoticed, but is a big part of why we are where we are today. And so I'm not going to assume that you even used FDE or that maybe you don't even know what functional data is. So just kind of start off at the beginning to make sure we're all on the same page. Functional data...really we think of it as anything that has an input X, in this case we have some temperature data. Our input is the week, so these are measured every week of the year. And our output Y is this temperature. And so, you know, you could sample to find a resolution, maybe every day, or every hour, but the the general idea is you haven't completely sampled, you know, the whole function. You've got some way to work with some sampling of it. And really in all the cases that we care about, although you can use one function, we also want to think about having multiple functions. So we've got multiple weather stations where every week, we've captured the temperature. So, although we have every week filled in here, we don't necessarily have to have that feature of FDEs, as you can have, you know, some months sample points or they don't...they can be sampled at different time locations. But it also doesn't have to be, you know...your input doesn't have to be time in the traditional sense. Maybe it's temperature and you've got this clarity of measurement as your output. You know, so we support that. it's really this mapping of you have an input, you have an output. Or maybe you're in a spectral setting and you have all your input is the wavelength, and now your output is the intensity. In an example I'll show to illustrate direct functional PCA, we have multiple streams of data, multiple functions, so not only do we have, you know, a different...different function but we've got different outputs. We've got, you know, a charge, piston force, the voltage and a vacuum and we want to bring all of these together and analyze them. And so, these kind of types of data help motivate the development of functional data explorer. And there are two primary questions that we... we usually use it to answer. The first is in a functional design of experiments setting. In this setting, we have factors that we want to try and relate to our response function. So here in this case, we're really interested in how do we get this response function to be shaped a certain way. So we can link up these factors to the function and, in this case, we wanted to...our function to remain in the green specification area for as long as we can, and with FDE, we can...we can help do that. The other common case is in what we call functional machine learning and, in this case, we want to think more of our functions as being inputs to something. So in our process, maybe we have the final results or a final yield, in the case of this fermentation process data. So we want to summarize the shapes of these functions. Use them as inputs to a predictive model to help figure out what is going to give me the best result. And so, really, the big game with this, that you try to play, is it's functional PCA and what functional PCA is doing is it's going to summarize our data. So I have here a really simple example where it's probably pretty easy to see that they all have different pairings slopes. And it may be a little harder to see that the means are kind of a little different for each one. So our goal is to motivate this is to try and summarize these, you know. Use a simple case to how can we expand this to more complex situations. And so, when we do ??? composition, we'll get orthogonal eigen functions that are going to explain as much of the function to function variation as possible. And, once we have those, we can use them to extract summaries from all these functions that we can use in predictive modeling. So we take, you know, some really complex shape, in this case it's not super complex but we're going to summarize it to a couple of data points from, you know, the 10 or so that we have here. This is...this is what an eigen function looks like for these data. So we pulled out two. The first is explaining around 77% of the variation and it's giving the most weight to the very beginning and to the very end. And this is, as you might expect, summarizing that, you know, it's quantifying the slope of these these data. Whereas the second eigen function is giving equal weight over the...over the whole input and that's that's quantifying that difference in the mean. So now we want to turn these Eigen functions, use them with our data and get a quick summary that we want to use to explain the differences between these functions. So taking Function 2 as an example, we multiply this function times this first Eigen function. We're going to get a function that has a lot of negative numbers. So if you kind of think about, you know, taking the integral of this, just kind of adding up all those negative numbers, you're going to get a fairly large negative number. So we can see how this first component is really capturing the differences. So number 10, right, was a large positive number, and so we can...we can kind of see back in our data that it also had that kind of large positive slope. And a similar idea for the second component, where we have, you know, higher versus lower overall averages. And, once we have these two things we can then go back to our original function. Adding in an overall mean, we just take the first functional principal components score, looking here at Function 1. We multiply it by the first eigen function and then we add in the second functional principal components score, multiply by the second eigen function, and that's going to recreate this first one. In a similar process, using the scores with the second function allows us to reconstruct or approximate those functions. So you can kind of see where, okay, if we are able to build models for these FPC scores, we can, you know, understand how DOE factors change them, in which case that's changing the shape of our alpha function and sort of that first scenario we looked at. So let's let's kind of go into why direct functional PCA? Again what what motivated to do this? So if you have used FDE before, the modeling options we have are considered basis function models. And in these cases, they're really...they're smoothing the data first, so we're going to fit the model, we're going to get a smooth function that we can operate with. But part of the problem is we may have a lot of a lot of things that we have to kind of tune. So in a case of B spline, we have to pick you know what's the best, you know, degree of the polynomial to use in the spline, or how many number of knots should we have? And we kind of give you some defaults for that, but, you know, we also allow you to change the locations of those knots. And this is great and works for a lot of cases, but as you get larger sample sizes or more complex functions, this could take a lot of time and it may be on track to really tune all the locations of these knots. But the way FDE works now is we fit the model, we'll perform FPCA on the coefficients of those models and there's this nice relationship between, you know, now our Eigen functions are in the same form as the model of our data. So there's a lot of nice things that come with it, but whether it be costly computing or just the models don't do it well, we needed another approach. And so the previous approach is smoothing the data first, and now we're thinking about, okay, let's just take the data as they are. Let's operate directly on that. And then from there let's smooth the eigen functions that we get. So this isn't a fair, you know apples to apples comparison, but with with B splines compared to direct FPCA, you'll...you'll tend to notice that the eigen functions you get are a little smoother with direct FPCA and that's by nature of the way we wanted it to be a little smoother, you know. Is this little artifact here really that important. The FPCA says, maybe it isn't. I think it's really captured in the last eigen function, where it doesn't explain a lot of the variation. We're kind of getting some weird bumps here. You know, an expert maybe ought to analyze this and say, no that's actually real, but most likely, maybe we shouldn't really be giving a lot of weight to this eigen function. Probably not using it. And so the algorithm we use to fit direct functional PCA model is similar in spirit to this Rice and Silverman method, mainly that it's it's an iterative process. So what we do is, you know... first I should mention that the data needs to be on a regular grid. So if you do not have your data on a regular grid, we will interpolate it directly for you. We also have some options...data processing step options called reduce that you can apply to kind of finely control the grid that we operate on. But in our procedure, we'll take one eigen function. We can just ask for the first first component. Fit a very, you know, fit a smoothing model to that and then ask for the next one. And once we've smoothed the next one, we need to make some adjustments so that we get you know the orthogonality properties that we want. But the idea here is that we're...we've taken a problem where we're trying to smooth a lot of different functions to first get models to work with, where now we're going to really focus our efforts on smoothing these Eigen functions one at a time. And we have a much smaller number of them, so this makes this technique much faster for large data sets than existing solutions in JMP. So the example I'm going to go over is a in a manufacturing process and just to kind of give you an idea of speed ups, if you, sort of, best case scenario, you knew the exact P spline model that you wanted to use for these data, it would take you three times as long to fit those models than the out of the box direct FPCA solution. Where the difference is, you don't necessarily know what those those models are going to be so you have to take time to fit multiple models and so now you've really taken a lot of time relative to what direct FPCA is going to do. But, so this example looks specifically at a step where we're bonding glass to a wafer. And this process, you know, there's like a vacuum surrounding it, there's some tools, it's all sitting on this chuck, and this process runs and, unfortunately, about 10% of them get destroyed, but this is just sort of in the middle of process and you don't get to know until weeks later that they were destroyed. So our goal is...we have sensors that are collecting data through this process, we want to try to use that to identify wafers that we can sort of get rid of early. Maybe there's a subset that we can get rid of, so we can not spend any more money on them, and so the goal is to try and identify that using our sensor stream data. So now I'll go to JMP. And I have a journal, so all these sources will be available on the Community page but I'll open up this data set And I'll launch functional data explorer from the analyze specialized modeling menu. So let's go through each of these columns. So we have the wafer ID, so this is just sort of our... groups our different functions. The condition, so this is, you know, was it good or bad? So we want to keep that, we want to use that later so we'll just put it as a supplementary variable and that tells FDE, you know, when we save things in the future, go ahead and bring along that with it. charge, the flow, piston force, vacuum and this voltage as a part of this process. To launch FDE, we just kind of scroll through and see you know I showed these earlier, but these are all the different types of data we have and necessarily...it may be possible that not every model would fit, you know. The same model that fits this maybe it doesn't fit, you know...same model that fits charge maybe doesn't fit flow as well. But direct FPCA sort of is looking at them individually and it's kind of handling that for us. Before I before I go to that, I want to show this reduce option that I mentioned. So by default... well there are three tabs here. You can directly put it on an evenly spaced grid, you can bin into observations, or you can remove every nth variable to kind of fill it out. These data aren on a grid. We don't really have to do anything, but we could, you know, say all right, let's do this and, by default, it gives you half of the original data set. So now you've you've taken it down and the shapes are still fairly the same. In this case we don't have to do it, it's still fast, but if your data are either not already on a grid, or you just have a lot of it, by using reduce you're still able to capture the key features. It's really something worth exploring. So, since we have multiple ???, I'll go to this FDE group option and launch direct functional PCA. And so, this has taken a few seconds, but it's for each one fitting a model. charge. So we're able to fit, works reasonably well, we have diagnostics available. We have a model selection option to let you know, change the number of FPC scores. We've identified as four as being best for this particular case and most of others that actually picks just one. But if you have used FDE before, you know, you'll see this familiar functional summaries but there's nothing...we don't have anything else. There's no prior model right. This is kind of this... functional PCA is the model and we're focused entirely on that, instead of other things that we necessarily had to tune before on my your score plots and profilers. So to scroll through these, we can see that we are you know seemingly fitting these fairly well. Piston force...and I said that, you know, a lot of these end up just picking one. I mean it's saying this is explaining almost all all the variation in that case. Okay. Good. Again, voltage, so last one. So if we go back to this group option we can save the summaries for all of these functions. So now we've effectively, fairly quickly used FDE to load our data and take all of those functions and summarize them down into just a few summaries for each one. But the main ones we are most interested in, primarily because they're just summarizing the variation in the shapes, or these FPC scores. But there is still information in things that people, you know, pre FDE, what would they do? You take the mean, or you look at standard deviations or other summaries, and these things still have value, so I think it's good, you know, we, by default, bring those along. We have some scripts. So every script you had in your original data table is going to be brought along as well, but we also add these profiler scripts so you can launch those and see, okay as I'm changing my FPC scores, what does that mean? And help build build some insight into that. And so what we want to do now is we're trying to predict that condition, good or bad. So I'm going to use generalized regression just because I think it's, you know, it's a really good method for not only fitting these models, but also interpreting them. But you could really use anything. You could get a neural network, you could do any other thing. Once you've...once you've saved this table, you're free to use the rest of JMP for how you feel like you could model it best. So I'll take all of these summaries and I'm going to just do a factorial tp degree two. And so we're trying to predict the final condition and we'll use the validation data set. And we're targeting as it as, was it bad or not? And so, really, this is probably the longest computing section of this. We'll do a Lasso by default using this validation column. I mean it takes, I think, around you know 5-10 seconds. You can always stop it early if you want to, but you know we're we're giving it quite a good set of features and looking at interactions between them to try and figure out, you know, the best way to try and predict this condition. The... almost done. It's just building the report now, and and once you get this model, you''ll also see things that it felt like didn't map. So I'm personally a fan of looking at the parameters on a centered and scaled basis to, you know, to help better understand magnitude differences. But some of these, you know, like this FPC score, I thought, you know, it's really not informative whereas FPC 2 and 3 for the charge are more helpful, so you can kind of sort them and see, you know. And kind of like I described, things like simple summaries of means or minimums or standard deviations, they're giving information but, but we also see there's definitely a lot of information in the summaries of the... of the shapes of these functions using these FPC scores. So I'll go on the other side, whereas this you know this flow standard deviation seems to be very important and I think, you know, another reason why I like this is, if you are in charge of this process, and you have control over the flow standard deviation, this can help you, you know, maybe you don't need to know what the first step is, well, hey, maybe we have a part of this we can actually improve before we build a model, to say okay, now let's just discard these. So let's say we've done that, where this is our model, we want to try and just build a heuristic good or bad. At what point do I just want to go ahead and discard it? So let's save the columns for the prediction formula, and this will be saved back to that summaries table. So GenReg is going to give us probability bad, probability good, most likely condition. And so we can...we can just stop here and say okay, if it's above .5, let's look closer at it, or if it's above, I don't know, .75, that feels good. Let's...maybe it's not worth it just due to the cost of the rest of the process. You know, you kind of maybe pick that probability based on the real-world implications. Or we could let a partition model help us figure that out, so what we're going to use now is this probability as a factor. We're just kind of saying okay, I could kind of look at this by hand and let's have a model just help me try and figure out, you know, how is it going to group up these conditions. And so we'll take this, maintain our validation data sets so we're not kind of double dipping. So now all of our blues are the bads and the reds are the goods, so thankfully, in general, most of them are good. But let's do a split. So if this probability of being bad is less than .1, very likely, it's good. Most of our bads are in this greater than .1. It makes sense. Split again. Now we're really looking at...a lot of them are really in this is the probability over .25. We'll split one more time, and at least for the training data, it fits pretty well that, hey, all of them with above .6, they're all bad. You know, you really don't expect that to happen in reality, and this training or this validation R squared does highlight that, you know, like most models, you can do better on the training set than you can do on the validation data set. But it's giving us now, you know, I kind of felt like .75 was good, but really this is saying, maybe we really need to focus on these that are .6 or higher. And so, this is, you know kinda... I guess now now, we've gotten to the point where you know, these were simulated data. In the real-world case, this was very helpful, things like interpreting, you know, are we...can we improve our process, also helpful, not only in this, but in other sample data sets that we have. Let's go back to the slides. Okay, so I kind of went through and summarized what funtional PCA is, kind of, what motivated this new direct FPCA approach and showed you how to use it in an example where we're trying to, you know, discard things early in this manufacturing process. So really just kind of some final tips is, you know, it...the fast computing makes this great for large data sets. In some ways, you can just start there and say, okay, what's direct FPCA think? It's so fast, I don't have to you know fiddle with model controls. In some ways, if you have large data, it's a great place to start. But it's not perfect, you know. I showed some diagnostics and you can see if it's not fitting well and like like any model, just because it was fast doesn't mean it was good. So just kind of make sure that you're identifying possible issues and maybe you need a different approach. I mean we have, you know, other even basis function models that we're working on for very particular types of data that, you know, even this approach doesn't necessarily do as good on as, you know, these very specific models. And that's, you know...part of this is the data must be on a grid and try to use reduce to help you control that grid, if it seems like things are either too slow or don't really, kind of, makes sense, maybe what we're doing by default isn't what's good for your data as a as what you can do yourself. Thank you so much for coming and I'll answer any questions if anyone has anything.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

Rich Newman, Statistician, Intel Don Kent, Data Analytics and Machine Learning Manager, Intel We have a set of responses that follow some continuous, unknown distribution with responses that are most likely not independent. We want to determine the simultaneous 95% upper or lower bound for each response. As an example, we may want the lower y1 and y2 and upper y3 and y4 bounds such that 95% of the data is simultaneously above y1 and y2 and below y3 and y4. Finding the 95% bound for each response leads to inaccurate coverage. The solution: a method to calculate the simultaneous 95% upper or lower bound for each response using nearest neighbor principles by writing a JMP script to perform the calculations. Auto-generated transcript... Speaker Transcript rich n Hello, my name is Rich Newman, and I'm a statistician at Intel. And today I'll be presenting on a JMP script that determines a simultaneous 95% bound using a K-nearest neighbor approach. This presentation is co authored by Don Kent, who's also at Intel and both of us are located in Rio Rancho, New Mexico. I'd like to start today's presentation by motivating the problem, and from there, I'll share some possible solutions and ultimately, land on the solution that we went with. Along the way, I'll provide some graphs to help further illustrate the points and then finally, I'll show the JMP add-in that we use to solve the problem and some screenshots illustrating the script. we're designing a device, and we need to know what is the worst-case set of fourr resistance and four capacitance values that we see. And worst case for us is defined as the low resistance/high capacitance and the high resistance/low capacitance combinations. So for clarity, we have eight variables and we may need the simultaneous bounds of the four low resistance values, the four high capacitance values, so we can use it to help us design the device. And worst case may be defined as 95% confidence or 99% confidence and ultimately, that's up to the user. Alright, to illustrate this problem, let's use just one resistance and one capacitance value. So I have resistance on the X axis, capacitance on the Y axis, and we want to know two things. We want to know that yellow star, the worst low resistance/high capacitance combination, and we want to know that purple pentagon, the high resistance/low capacitance combination. Alright, with respect to our problem, I want to point out that we recognize that these eight variables, these eight responses are not independent. There's some correlations among them. Furthermore, each of these responses may or may not follow a normal distribution or the multivariate normal distribution. We ask these types of questions frequently. So in other words, we do not want to solve this once and be done with it, and never deal with it again. We get asked these questions often, so we really need a robust solution that's easy to use. In our case, we're very fortunate that we tend to have relatively large data sets, at least 400 points, and we typically have 1,000 points. And for us, a practical solution works as long as it has some statistical methodology behind it. So if I go back to that previous graph, it's not like I'm going to throw a dart on that graph and say, wherever it lands, it's going to be our worst case bound. You really want a little more meat behind that, but I do want to point out, we don't necessarily have a definition of worst case, whether it's 99% or 95. And we just know it's better to be a little bit conservative, to make sure we're designing a device that's really going to work and not have any issues in the future. Okay, I want to share a completely made-up example, just to illustrate that this type of problem can happen in any industry. And so, imagine we made adjustable desks for the classroom, and we want our desk to work for 99% of the population. a person's height and a person's width. Now, when you have JMP, it comes with some sample data, and one of those data sets is called BigClass. And in BigClass, it has some students in there and their heights and weights. And so we can use that data set to help us determine the height and weight bounds that capture 99% of population. So if I look at this graph here on the bottom right, we see the points represent...each point represents one student's height and weight combination. Okay, if I go back to our problem, our current approach, which we believe can be improved, is we independently find 95% or 99% prediction bounds for resistance and capacitance. And in this example where I'm just looking at one resistance and one capacitance, we would find two separate bounds. So as an example, we would find the 95% prediction bound for the resistance, which is 4.52 and 4.97 and that's designated by the darker blue lines. And then we would find 95% prediction balance for capacitance, which is 15.6 and 16.5, which is designated by that greenish blue lines, and then we find the combinations to get us that yellow star and purple pentagon, which gets our worst case. Now we have a little bit of concerns with this approach. And the first concern is around a Type I error rate. So when I find a 95% bound for resistance 1 and 95% bound for resistance... for capacitance 1, keep in mind, I have eight variables, overall I know my confidence levels, not 95%. And what it is, will depend on the correlation the variables. Now we can get over this hurdle by making an alpha adjustment, but there's another hurdle that I wanted to discuss that that has a bit of a bigger concern for us. All right, what if we were interested in the high/high combination, which is designated by this yellow circle? In this particular example, you can see we don't have any data near this worst case bound, so if we were to use this, this is extremely conservative. And when we go to design our device it's going to have a cost and a time element associated with it, so we want to be a little bit conservative, but we wouldn't want to be so conservative that we would use this yellow circle because we're really not getting any data points around it. Okay, there are some alternative approaches that are easily done in JMP that we wanted to consider as solutions to this problem, and the first one is density ellipses. And this is found in the fit Y by X platform, so if I hit that red triangle on bivariate fit, I can choose density ellipse, and in this case, I chose a 95% ellipse. And I get that that red ellipse on the graph. Well, when JMP provides this density ellipse, if you look at the bottom right hand corner of the presentation, it presents the mean standard deviation and correlation of the two variables. And what JMP does not provide is the equation of the ellipse. Now, this is a hurdle we can we can get over. We just have to do some math to be able to solve it. But the bigger hurdle is what happens if you have more than two variables?. In this case, JMP doesn't have an easy option for us to solve this problem. So we could do pairwise ellipses or just do two variables at a time, but we're going to have the same alpha problem and it's going to be pretty difficult to pick out what points we want to use as our ???. Now there's also one other very minor concern in this approach is what if we were interested in that high/high corner or in that yellow circle? What point on the ellipse do we choose? And again, I think that's a hurdle we can go to...get over, but the two...when we have more than two variables that's a hurdle that's pretty tricky. We're not sure it is easily solved. All right, there's another approach that we can...we can...that's easily done in JMP and that's principal components. And what principal components does is it creates new variables, and JMP will label them Prin 1 and Prin 2, such that the new variables are orthogonal to each other. And the fact that they're orthogonal we can use to help us solve what's our worst case bounds. And this is found in the in the multivariant methods platform, so if you go to analyze, multivariate methods, principal components, we can ask for these principal components. Now the concern with the principal components approach is that the math is extremely difficult when there are more than two variables. Furthermore, in theory, principal component tries to reduce the dimensionality so, in other words, if I had eight variables that I wanted to try to find this worst case simultaneous...simultaneous bounds on, but JMP may come back and say, okay, we found three main principals that that really help explain what's going on. And that case we have three equations and we have these eight unknowns and it really puts us in a difficult place to solve the problem. And so, for that reason we wouldn't use this principal components approach. So where does that leave us? So our goal is to find the simultaneous worst case bound, that high/high, low/high or low/low/high/high combination. We want to use...we'd like to use JMP to help us solve the problem. It has to be able to handle three or more variables. Each variable may or may not be normal. We expect some correlations. The good news is we tend to have relatively large data sets. We want to make sure that if we asked for this corner, if you will, that there's data around there and we're not stuck in a situation where there's no data. And again, an easy practical solution may be sufficient. Okay, so what I want to do now is explain a concept, and then I'm going to show you how that concept is used to solve our problem. So there's a concept out there called the K-nearest neighbors approach and the idea is, you have a point and you find a distance from that point to every other point. And then you determine the point's k-nearest neighbors, or the k points for the shortest distances. So to make sense out of this, let's look at an example. So let's focus on that point that's highlighted in red...that dark red and it's coordinates are 4.51 and 15.66. If I take that point in blue, coordinates are 4.67 and 15.54, I can find the distance between those points. Or the idea of the nearest neighbors is for that point in red, I can find the distance from that point at every other point in the data set. And then I can sort the distances from smallest to largest and then I can pluck off the ones that I want. So, for example, if I wanted to know when k=2, two points that the nearest neighbors to the one in the red, I see they're the two points in pink and they're the ones that are the closest to the red one. Okay, so what I want to do now is talk through the solution and then...at a high level, and then I'm going to slow down and and walk through the steps. So our solution was based on, if you have a large data set, we can first find the median +/-3 standard deviations for each variable (and I want to point out, you could also use the mmean), and in doing so, we define what we call our targeted corners, or our desired corners, and that's the lower the highest...based on the lower the high point of each variable. And then, what we're going to do is we're going to find the distance from each point to that targeted corner, we'll sort the distances from smallest to largest, and, in our case for our needs (and I'll explain this a little bit more in an upcoming slide) we take the k-nearest neighbors to the targeted corner. And in general, you can collect k-neighbors that represent desired confidence. And again I'll explain this a little bit more in just a second. Then what we do is, we take the average of the k-neighbors and that becomes our solution. Okay, so here's the idea is we start with our two variables, we find the mean +/-...excuse me, the median +/-3 standard deviations and we can also use the mean. And in doing so, we call those our targeted corners, so if I was interested in the low/high corner, the high.low corner that yellow star in the Pentagon they start as what we call her defined targeted corners. The next step is we take all the data points and we find the distance from every point in the data set to those targeted corners. And once we get those distances we sort them from smallest to largest. And then, in this particular example, we find the k-nearest neighbors closest to the targeted corners. So just as an ilustration, you can see the five points in yellow are the five points closest to the yellow star, and the five points in purple are the ones that are closest to the purple pentagon. Then what we do is, we take the average of those five points, respectively, and in doing so, these these black ellipses represent what we would use as our worst case value. And then we will use that to help us design our device. Now I want to show you those points relative to the density ellipse and relative to these targeted corners, and the yellow star and the purple pentagon is what our original method was. And we can see the density ellipses aren't bad, they get a little bit better. And what's really nice about this particular solution is we have data points near them and that's exactly as it is designed. Furthermore, it's not too conservative for us, so we don't have to pay this extra cost, if you will, when we design the device and we didn't have to worry about the distributions of the data. The correlations are really not concerned in how we solve our problem. All right, let's say you were interested in the high/high and low/low bounds, that light blue pentagon or in the darker blue star. This method works as well. And what you see in the black ellipses, our solution, is that again we have data points here. And so, for us, this is a wonderful approach because, especially relative to our current approach, it's not too conservative. It may be a little bit conservative but it's not as conservative as this pentagon and the star. Okay earlier, I made some comments about how we approach it and there's some choices, so so I want to discuss now the choice of K and should we average? And to me that K may be based on your confidence level, your sample size and kind of your philosophy, and let me explain. As an example, let's say I had 1,000 data points and I wanted to be 95% confident. In that case, I can take the 25th closest distance for the two corners and that would be 2.5% out on the low side or the low/high side and 22.5% out on the other side and together that I've captured my 95% confidence. So I could just take the 25th closest distance and be done. That's one approach. I can also take the 23rd, 24th, 25th, 26th, and 27th distances, and average them. And take the average of five values and use that as an approach. So there's a couple different ways you can handle it. In our particular needs, again we have very large sample sizes and we want to be a little conservative and we're not driven by 95% or 99% confidences. So just as illustration purposes, those orange circles on the graph on the right, they may represent, as an example, the 95% confidence interval and it may just be the average of five points or maybe that 25th closest distance, as just an example. And what we would do is instead of using that approach, we would actually take the average of the first 25 points, and in doing so we'd end up with the black ellipses and you can see they're moving out and it's making it a little more conservative. And so we do that by design, a little more conservative. And again it's a it's a choice, and for us what's nice is it's not as conservative as those desired corners, our current approach, so so we get a little conservative nature in there, but we're not grossly conservative. Alright. So what are the pros and cons of this approach? The positives are we do not need to know the distribution of the variables. We can easily handle some correlation or dependent variables, we can easily handle multiple variables, especially more than two. We know there's data close to that solution and part of that's dependent on that large data size and we can build a script in an add-in in JMP to easily perform the calculations. The negative is that it does require a decent-sized data set, because if you want that 99% confidence level as an example, a real high confidence levels, you really need lots of data. All right, so this is what our add-in looks like, this user interface, and so you have the possible variables in the upper left hand corner. And then on the high side you enter in, for example, we want the the high values of the resistance and that's in that green highlighting. If I go to the purple highlighting, we can add in some values on the low side, and so in this example, we'd say we want the low combination of capacitance. Then the next thing you have to do is enter in your confidence that you want. And we have a recall button, which is nice for convenience for people. And then we have our team logo, which makes the item look nice and professional. Alright, once you run this add-in, it will trigger the scripts and the output will be a JMP table. And in this JMP table, the first thing I want to point out in this highlighting, in this green highlighting, is for all eight variables, we're getting the median, the standard deviation, and whether it was the low side or high side we were interested in. And so from there we'll build the desired corner, so it would be the median minus three standard deviations for the low side and the medium plus three standard deviations on the high side. And again in purple now, this is our targeted corner. Then the next thing we do is find the distances for all the points to that desired corner. And then we're going to find in this example the five points that are closest to that desired corner, and that's that neighborhood values. They're going to be our vector of five values. The neighborhood indices, I'll explain a little bit more on the next slide. Then in blue, we take the average of those five nearest neighbors and that becomes our solution to the problem. So that column in blue is the worst case values or they are the worst case values that we would use to help us design the device. Now we also have a column in here called Neighbor Z-score, and what that is, it takes that neighbor average, our solution, and kind of works backwards and sees how many standard deviations away it is from the median. And the reason why we do that is because our original approach was to take roughly this median plus or minus three standard deviations. And what we're finding is, to get what we want, we can actually use a much smaller multiplier. So this was just helping us know how conservative or overly conservative our current method is. It's not being used in any calculations, other than just helping us understand. All right, I mentioned those neighborhood...excuse me, the neighbor indices. So in the upper right I highlighted in purple the 109 and 126. That corresponds to the rows of the data. So when you run this add-in, you get your JMP table. It'll tell you what five rows represents the five nearest neighbors, and also selects them in your original data set. And what's nice about that is, it makes it easy to color code. So earlier, I showed you the example with the yellow, so five yellow and the five purple. And it's easy once you run this add-in just to change the colors right...right after running the add-in. Alright, so this is what the output looks like for our eight variables. And you can see, the green points represent the low resistance/high capacitance values, the red points represent the high resistance/low capacitance values. And I just want to point out in the bottom right part of this graph that I've highlighted in purple, that you can see, for example, the green and red points, they're not that extreme points for any given variable and that's the simultaneous aspect of this problem. It's really solving the problem across all eight variables. And so for some variables, that may be extreme, for others it may not be, and that's fine with us. But again...and in doing so that's helping to understand the relationships between our variables and this again would be a graphical display of what our final solution be, our worst case...our worst case values. Okay, if I go back to that BigClass data set just to illustrate the add-in, and I want to recognize that this is a small data set. This is just for illustration purposes. I can run the script and ask for the high/high side, I can run it and ask for the low/low side. In doing so again, like before, I'm going to get the median and the standard deviation. I'm going to use it to find the bound. And that's my desired corner, then I'm going to find the five neighbors that are closest to that bound. And you can see in green, those are the actual values, and those are plotted on the graph in yellow and purple. The neighbor indices refer to the rows and that's what allows me to color code my data quickly. The neighbor average is our yellow star and our purple pentagon, and that would be our solution to the problem. And again we do the Z score just so we have an idea internally for for how conservative this method is. Okay, at this point, I just want to highlight some some points of our script. So when we built our script, and you can see in the upper right, you know we built this panel box. And light blue, that's where we ask the user to input variables that are...we're finding on the high side and the variables to be found on the low side. In that copperish color, that's where we're getting input of the percentile and that's actually a number. And you can also see on lines 222 and 223, that's what we're building a recall function. Alright, one of the things that that we like to do is data quality checks, so you know we want this to be as mistakeproof as we possibly can, errorproof. And so, for example, you have to input that confidence level as a number, and so, then we do a check to make sure it's between zero and 100 or the user gets a error message that they need to change their input. Likewise on the bottom, we need something in the low list or the high list in order to run this, so we have these data checks to make sure that something's been input. Alright, the next thing we do is have something called the sumcols funciton. And what the sumcols function does is it loops through the past columns. So, in other words, it's taking the lows and highs that you've inputted and returns a dictionary or associative array with some information that's important to us. So if I look on the on the right, the bottom right, excuse me, if I look on the bottom right, for example in orange it's going to collect information on the data type, how many data points there were, what's the median, what's the standard deviation. In purple it's going to do that calculation and get us our desired corner or targeted corner, that we are going to call our bound. And then in blue, that AA, that's our associative array and that's bringing all the information or storing all that information that we need for the next step. All right, what we do it from from here is we start to calculate those distances. And so you can see in the blue highlight, and that's on the low side, and if you go right under it, that's on the high side. We start finding distance from each point to that targeted corner. And we have to do the math, you know, that was shown earlier, where we're squaring and then taking the square root but but in essence we're just finding that distance from the points to three sigma. And then on the bottom, we're going to go through the process of sorting them, ordering those distances, and then plucking off the ones we need. And I want to point out in orange highlight that, six, seven and eight. Whenever we write our scripts, we do our best to have comments. And sometimes when one person is working on a script, unfortunately, they get pulled to something else, and someone else can finish it, so it's really nice to have these comments so someone else can take over and really understand what's going on. Furthermore it's nice that these comments, even if you're the only one working on it, that if you have to go back to it, you know, years later, you remember what each step is doing. So we do our best to put in comments. All right, finally, what we do is, you know, we run through the low and the high side and we return some values in a dictionaryj. And you can see in blue, what we do is we pluck off the minimum neighbors and that's that vector that came out in the output. Then we take the average of them and that's our solution; that's happening in orange. And just as an example, in purple that we find our Z score. So so once we get the information, all those distances, have sorted them, we then pluck off the information we need to then build a table that that's our. Alright, so putting all this together. Our motivating problem is we wanted to find these simultaneous worst case bounds to help us design a device and our current solution is too conservative; it costs us money and time. And when we go to solve this problem, we know the data may or may not follow the multivariate normal distribution, our data is not independent. And the frequency of this really requires a simple solution, and preferably with JMP, and so our solution was to build a JMP add-in that's easy to use, it uses the k-nearest neighbors concept, and the output is easy to understand, and it helps us quickly build those graphs that we can color code, so we can show others. All right, thank you very much.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

Wenzhao Yang, Statistician, Dow Chemical Company EPDM is a synthetic rubber widely used in applications such as transportation, infrastructure, sports, leisure, and appliance. Dow, as a leading manufacturer of EPDM, continuously innovates in the development of EPDM products and applications to achieve superior properties, including color stability property in automotive weatherstrip. In this Dow case study, the color stability properties of different EPDM rubbers were repeatedly measured over time (repeated measures). The objective of this study is to develop fundamental understanding of EPDM weatherstrip discoloration mechanism and validate hypotheses on EPDM microstructure factors. Efficient DOE strategy and proper statistical models are developed for cause and effect conclusion. We analyzed the data using two methods: linear regression and random coefficient regression. Linear regression completely pools the data by assuming a common variance for all samples across time. Random coefficient regression incorporates the sample-specific effects and provides more inference in variability between samples over time. We identified significant structure effects for color stability property by comparing different methods. In this poster, we demonstrate the power of DOE and statistical modeling for research and fundamental study. Auto-generated transcript... Speaker Transcript Hello everyone, my name is Wenzhao Yang and I'm a statistician at Dow. Today i'm going to talk about statistical DOE and modeling development for repeated measures in rubber research. Before we talk about the methods, I just want to add a little background for this talk. EPDM is a synthetic rubber, widely used in applications such as transportation, infrastructure, sports, leisure, and appliance. Dow is a leading manufacturer of EPDM. We continuously innovate and develop our EPDM products to achieve superior and user properties in various applications just mentioned. This talk we're focused on EPDM based automotive weather strip application. One of the key performance metric is called color stability property. Color stability property is measured repeatedly over time on the same experimental unit. it is defined as repeated measures in statistics. And time dependency may exist between repeated color measures which is known as auto correlation. A new technical development in our work is we developed a Monte Carlo simulation based DOE strategy for repeated measures to assess the statistical power of detecting active effects prior to the data collection. Let's move on to the objective and the methods. The objective in this application is to develop a fundamental understanding of EPDM weather strip discoloration mechanism and validate hypotheses on EPDM polymer macro structure factors for color stability property. The color stability test experiments follows the industrial rubber manufacturers standards as shown in the figure one. We start with a list of synthesize EPDM polymers with different micro structures. Then we blend them with other consistent formulation ingredients under the same process condition. After compounding curing and sample preparation samples are aging in a weathering chamber. Delta E calculated from LAB color measurements is a critical performance metric for color stability property. It quantifies difference between the initial color and the color at different aging times of a cured sample. We developerd a D-optimal DOE to select a representative subset from our available polymers. We use the Monte Carlo simulation to evaluate number of repeated measures needed to obtain 80% of statistical power detecting detecting the main and the interaction effects. The collected data has unequal time intervals among repeated color measurements. Therefore, we developed a random coefficient model, RCM, to incorporate the sample specific effects and provides more influence in variability between samples over time. We also compared RCM with linear regression model. Which completely proves the data by assuming a common variance for all samples across time. With the methods described here are the key results for this work. Figure two shows a Monte Carlo simulation based power analysis results for main and interaction effects three under different scenarios. If we expect medium auto correlation level, which is about .5, it will adjust and time points. Number of repeated measures should be at least nine per cured sample. If the autocorrelation between adjacent time points is really high about .9 the statistical power drops significantly for most of the effects. Since we selected relatively large time intervals between repeated color measures for all DOE samples we assume that the repeated measures that will have medium level autocorrelation there for nine repeated color measures per cured sample are collected for this DOE. A general RCM model is showing in Figure three here, where we have a sample specific random intercept and sloping effects. In addition to the main and interaction effects of the EPDM structural factors in a linear regression model. The random effects covariance parameter estimate table here shows that there are significant difference in starting Delta E and changing rate of Delta E among the cured samples. This indicates that is really important to account for variability between samples over time. The profile for the RCM model shows that the confidence interval around the prediction line for our input factors are relatively narrow comparing to the scale of the Delta E in our data collection. So the figure four shows that if you treating data as independent in a least square model could really see where they inflate degree of freedom as showing the top graph and see where they inflate the degree of freedom, so we would be overconfident about our significance results for the model effects in the model compared to RCM models where we assume like the data should be time dependent. And the model prediction plot and the residual plot in figure five shows that RCM has good model fit and meets our model assumptions. Our conclusion for this work is we identified dominant microstructure factors and significant interaction between two microstructure factors suggesting alternative polymer design we developed fundamental understanding of EPDM weatherstrip discoloration mechanism and demonstrate the power of statistical DOE and modeling using JMP to support the development of new EPDM rubbers with superior color stability.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 4, 2021

Stan Young, CEO, CGStat Warren Kindzierski, Epidemiologist, University of Alberta Paul Fogel, Consultant, Paris Researchers produce thousands of studies each year where multiple studies addressing the same or similar questions are evaluated together in a meta-analysis. There is a need to understand the reliability of these studies and the underlying studies. Our idea is to look at the reliability of the individual studies, as well as the statistical methods for combining individual study information, usually a risk ratio and its confidence limits. We have now examined about 100 meta-analysis studies, plus complete or random samples of the underlying individual studies. We have developed JMP add-ins and scripts and made them available to facilitate the evaluation of the reliability of the meta-analysis studies, p-value plots (two add-ins), Fisher’s combining of p-values (one script). In short, the meta-analysis studies are not statistically reliable. Using multiple examples, our presentation shares our results that support the observation that well over half of claims made in the literature are unlikely to replicate. Auto-generated transcript... Speaker Transcript Stan Young haha. Sara Doudt allow you to. Stan Young see everything now. Okay, oh no Am I gonna run the slides from my side. Sara Doudt yeah yeah just like you would a normal meeting. There we go okay cool make sure I got everything if I'm close outlook Okay, we are recording I need to I'm gonna I'm going to turn off my camera just fyi and then. Okay. Stan Young To make this full screen so now just sort. of how to do that the second. Sara Doudt it's that bottom like in that, where it says notes on their conduct to the 34% it's the one that looks like a. Down the boat yep sorry about that one bad, on the other side of the person that. One yes, that guy. Stan Young haha. Sara Doudt You go, and I mean I'm I'm going to go we're going to start in just a second but I'm going to meet myself, do you have any questions before we get started. Stan Young or not really are you going to be recording me as well as the slides. Sara Doudt yeah yeah they asked that you have the. The camera up here. Over that's fine. Right. All right. Stan Young I'm ready to go. Sara Doudt I'm gonna mute myself, and if you have questions or need to stop, let me know but otherwise I'll let you. Stan Young Go. I'm going to present today JMP add-ins and scripts for the evaluation of multiple studies. Or you can call this how to catch p-hackers, people cheating with statistics, or why most science claims are wrong. I'm first going to describe the puzzle parts and how they fit together. Most claims actually fail to replicate. These are science claims. I'm contending, and my co-authors are contending, that this is due to p-hacking and it's a major problem. We're going to use meta analysis and P-value plots to catch them, so this is how to catch a crook. The JMP add-ins and scripts are P-values from risk ratios and confidence limits. They come from a meta analysis. And then Fisher's combining of P-values, an ancient technique which is similar to meta analysis technology. And then we're going to present a p-value plot, and we have a small script that will clean that up and make it into a presentable picture. Well, we see a bunny in the sky. There are lots of clouds in the sky, and if you sit out on a nice day and look around, you can probably find a bunny. So this is a random event. The bunnies are not actually in the sky, of course. Gelman and Loken actually published a small article and said the statistical crisis in science, so people are using statistics, they say, incorrectly and that's what we'll talk about today. Let's run an epidemiology experiment. We have 10-sided dice, red, white, and blue. They will become digits (.046 in this particular case). Let's just actually watch this happen in front of our eyes. Now we have a P-value. It's random. You did it yourself. It's so much fun, why don't we do it one more time? Red, white, and blue. Now, if you do that 60 times, you can fill in the table here. This is a simulation that I did with 10-sided dice. And you can see the P-values in four columns and 15 rows. So that's 60 P-values. My smallest P-value is .004. And with 60 P-values, you can work out the probabilities. About 95% of the time, you will have at least one P-value less than .05. Here one done by my daughter. She had three P-values less than .05; they are circled here. And then running an epidemiologist experiment is so easy that even my wife Pat can do it, and she has three potential papers here that she could write based on rolling dice and spinning a good story. P-values have expected value attached to them. So over on the right, we have P-values of .004, .016 and .012. Attached to each P-value is a normal deviat. You can see that my .004 would have a normal deviate of 2.9 and so forth. And on the left, we have the sample size, the expected deviation, and the expected P-value for the smallest P-value and then the deviation. So if we had 400 questions that we were looking at, the expected P-value would be .00487 and the deviate would be 2.968. Now it's this deviation that is carried from the base experiments into the calculations of a meta analysis, and we'll see that as we proceed along. How many claims in epidemiology are true? I published a paper in 2011 and I took claims that had been made in observational studies. And for each claim in the observational studies, I found a randomized clinical trial that looked at exactly the same question. So in the 12 studies, there were 52 claims that could be made and tested. If you look at the column under positive, zero of those 52 claims replicated in the correct direction. So the epidemiologists had gone 0 for 52 in terms of their clalims. There were actually five claims that were statistically significant, but they were in the direction opposite of what had been claimed in the...in the observational studies. We're gonna look at a crazy question. Does cereal determine human gender? So if you eat breakfast cereal, are you more likely to have a boy baby? Well, that's what the first paper said. This paper was published in the Royal Society B, which is their premier biology journal. So these three co-authors made the claim that if you eat breakfast cereal, you're more likely to have a boy baby, eat breakfast cereal in and around the time of conception. Two of my cohorts and I looked at this, we asked for the data. We got the data and then we published a counter to the first paper saying, cereal-induced gender selection is most likely a multiple testing false positive. at the time of just before conception...predicted conception, and right around the time of conception. And there were 131 foods in each of those questionnaires, making a total of 262 statistical tests. You compute P-values for those tests, rank order them from smallest to largest, and plot them against the integers (that's the rank at the bottom), you see what looks like a pretty good 45-degree line. So we're looking at a uniform distribution. Now their claim came from the lower-left of that and they said, well, here's a small P-value. The small P-value says eating cereal...breakfast cereal will lead to more boy babies. Pretty clearly a statistical false positive. P-value plots. So we're going to use P-value plots a lot. On the left, we see a P-value plot for a likely true null hypothesis. Elderly long-term exercise training does not lead to mortality or mortality risk. On the right, we have smoking and lung cancer. And we see a whole raft of P-values tracking pretty close to zero, all the way across the page and a few stragglers up on the right. So this the right-hand picture is evidence for a real effect and the left-hand picture is support of the null hypothesis, no effect. let's talk about a meta analysis because I am going to use those during the course of this lecture. On the left, we see a funnel and we see lots of papers dropping into the funnel. The epidemiologist or whoever's doing the meta analysis picks what they think are high quality papers and they use those for further analysis. On the right, we see the...sort of the evidence hierarchy. A meta analysis over many studies is considered high-level information and then down at the bottom, expert opinion and so forth and so on. So the higher you go up the pyramid, people contend the evidence is is better. We're going to look at two example meta analysis papers. The first paper is by Orellano. nitric oxide, ozone, small particles and so forth. And it gathered data from all over the...all over the world, and this was sponsored by the WHO, so this is high-quality funding, high-quality paper. And it's a meta analysis. We're going to see if the claims in that meta analysis hold up. The bottom one was really funny. Patterns of red and processed meat consumption and risk for basically lung cancer and heart attacks. There's been a lot of literature in the nutrition literature, saying that you really shouldn't be eating red meat. We're going to see if that...if that makes sense. Let's go back and look at the puzzle parts and then see how they fit together. We know from the open literature that most claims fail to replicate. This is called a crisis in science. P-hacking is a problem. P-hacking is running lots of test trying this and trying that and then, when you find a P-value less than .05, you can write a paper. We're going to use meta analysis and P-value plots to catch the people that are basically P-hacking, and I call P-hacking sort of cheating with statistics. Others are, you know, little...described it a little differently. We're going to be using JMP add-ins. These JMP add-ins were written by Paul Fogle. And they allow us to start with a meta analysis paper and quickly and easily produce a P-value plot. We're also going to describe Fisher's combining of P-values and then we're... have a small script that will take a the two-way plot that comes out of JMP and then clean it up, so that the left and right margins are more...more attractive. Well here's long-term exposure to air pollution and causing asthma, so they say. On the left, we have what's called a tree diagram. That's the... it looks sort of like a Christmas tree. The mean values are given as the dots and then the confidence limits are the whiskers going out on either side. On the left are the names of the paper...papers that were selected by these authors and on the right are the risk ratios and the confidence limits. Now you can often just scrape the risk ratios and confidence limits off and drop them into a JMP data set, and so we see the JMP data set on the right. Now the P-value plot. thing written by Paul Fogle is going to convert those risk ratios and P... and confidence limits into P-values. Here we see that it's been done. The confidence limits can give you the standard error. With the risk ratio and the standard error, you can compute a Z statistic. And then you can take the Z statistic and and get a P-value. On the right here, we see a P-value plot coming out, just as it does with with a couple of clicks in JMP. Here again, on the left, we have the rough P-value plot, and on the right, after using a small script, we add the zero line, we add the dotted line for .05, and we clean up and expand the the numbers on the X and Y axes. And so we can look now and judge all the studies that that were in that thing, but we see a rather strange thing. We see that the all of this here, there are a few P-values under .05 and then a lot of P-values going up, and so we have an ambiguous situation. Some of the P-values look like the the claim is correct, and others look like they're simply random findings. Let's take a look at Fisher's combining of P-values. On the left, we have the formula. You take the P-value, take natural log, that sum them up times -2. And that gives you a Chi squared. You look up the Chi squared in the table. So this...this meta analysis, we have the P-values. Under P-value, we have -2ln, and you see that there are few P-values that are small that add substantially to the summation. If you think just a little bit, the summation is like all sums, it's subject to outliers manipulated...shifting the balance considerably. And so the few small P-values (.0033, .0073) are adding dramatically to the summation of the Chi squared. In fact the summation is not robust. One outlier...an extreme outlier can tip the whole...the whole equation. Now keep in mind that P-hacking can lead to small P-values and also, scientists quite often, if if a study comes out not significant, they simply won't even publish it. And if they find something significant, they typically will. So there's publication bias. The the whole literature is littered with small P-values and this probably...that's the tip of the iceberg. Under the iceberg are all the publications that could have happened, but they were not statistically significant. We're now going to look at a air pollution study. Mustafic in 2012 published a paper in JAMA and he looked at the six typical air pollution carbon monoxide, nitrous oxide, small particles (that's PM 2.5), sulfur dioxide, etc. If you look at these P-value plots, all of them essentially look like hockey sticks. There are a number of P-values less than .05. But then there are a substantial number of P-values that go up along a 45-degree line, indicating that there is no effect. We're contending that we're looking at a mixture. We're looking at a mixture of significant studies and non significant studies. And we're further contending that the significant studies largely come from P-hacking. There are other ways that can arise, but we think P-hacking is the thing. So if you look on the left-hand side here, in the red box, we give the median number of models or tests that could be conducted in a in a study, the studies that were in Mustafic's paper. We counted the number of outcomes, predictors, and covariates, and if you multiply that all out, the median number, which we call the search space, the median search space is 12,288. That means that the authors, on the average, had 12...in the median, had 12,000 opportunities to get a P-value of less than .05. They had substantial leeway to get a statistically significant result. We're going to look at the simple counting here for the nutrition studies. So we have here 15 studies. The authors, or the first authors on the paper. And then across the top you see outcomes, predictors, covariates, tests, models, and search space. The outcomes, if they looked at three health outcomes, for example, Dixon, and if they had 51 foods in their study, that will...those would be predictors. Now if they use covariates, that would add substantially to the search space. And you can see for Dixon, theoretically, he had 20 million possible analyses that he could have done. And if you look down the search space column, you can see substantial numbers of possible analyses that could be done. The nutrition studies were done with cohorts. A cohort is a group of people that is collected up, they're measured in questions, initially, then they wait over time and then they look at health effects. Each of these cohorts has the name of a data set that the cohort uses. And the cohorts, once they are assembled, can be used to ask, you know, a zillion questions. If the P-value for one of those questions comes out significant and they feel like they can write a paper, quite often they do. So, in the last two columns, we see the numbers of papers that could arise. That are more...we used Google Scholar and we actually checked it out. So these are the number of papers that have appeared in the literature associated with each of these cohorts. I'll add that we've looked at a lot of these papers and in none of these papers is there any adjustment for multiple testing or multiple modeling. Nutritional epidemiology, environmental epidemiology, those are the ones we talked about here. Nutritional epidemiology uses a questionnaire, food frequency questionnaire (FFQ). In an FFQ, you can have some number of foods. Initially, they started off with 61 foods and I've seen some FFQ studies where there were 800 foods. I wouldn't want to be the one to fill out that questionnaire. But given here are the number of papers that use FFQs over time. Since 1985...the technique was invented in 1986...since 1985, there have been 74,000 FFQ papers, and based on our looking at them, none of these papers adjusted for multiple testing and all of them had substantially large statistical search spaces. Environmental epidemiology, I simply did a Google search on the word "air pollution" in the title of the paper. And here we see that over time, there've been 28-29,000 papers written about air pollution. So far as I know, based on a lot of looking, none of these papers adjust for multiple testing or multiple modeling, and all of these papers have large...essentially all of these papers have very large search spaces. Meta analysis goes into the name systematic review and meta analysis and, recently, starting in 2005, journals have used that term in the title of the paper. So starting in 2005, I asked if the words "systematic review" and "meta analysis" appeared in the title of the paper, and you can see that started off low, 1,500 papers a year. 1,500 papers in that five-year period and finally in 2021, there have been a total of 27,000 papers, so this is a cottage industry. These papers can be turned out relatively easily. A team, often in China of 10 or five to 15 people, can turn out one meta analysis per week. And their their pay is rated on how many papers they publish and so forth and so on. So far as I know, these are all...half of these studies are observational studies and half are...come from randomized clinical trials. Virtually none of these and in all the ones that we've looked at, so far, particularly in observational studies, have this hockey stick look, like there's some P-values that are small and there are a bunch of P-values that look completely random. I will say that all the...essentially all of these studies are funded by your tax dollars or somebody's tax dollars. They're very lavishly funded by the public purse, is one way to say it. Many claims have no statistical support. The base papers do not correct for multiple testing and multiple modeling. The base papers have large analysis search spaces. And we've seen examples from environmental epidemiology and nutritional epidemiology that most people, I would say based on the evidence, are unreliable. I say and others have said too, we have a science and statistical disaster and use of meta analysis and P-value plots, these claims can be either verified true or not. Here we have four situations of smog. Upper left is London 1952; the right is Los Angeles 1948. The lower left is Singapore and I'm not sure...I don't remember the exact year for that. And then we have Beijing. Those are recent. In the case of the London fog, statistical analysis of death...daily deaths and everything indicate that there upwards of 4,000 deaths that occurred in a three- or four-day period in London in 1952. That instigated the interest by epidemiologist in what was the killer. The other three slides or other three pictures, there has been non reported increase in death during those time periods. The ringer is that in London, they were burning coal for heating and everything else. There was a temperature inversion and the contention is that acid in the air was carried by particles into the lower lungs and susceptible individuals, usually the old and weak, died in increased numbers, but that is not happening now around the world. There is all kinds of pollution around the world and we don't see spikes in death rates associated with it. Scams? Ha ha. I love scams. Are these scams? Air pollution kills. Any claim from an FFQ study, for example, if you drink coffee, you're more likely to have pancreatic cancer. That's one of the claims that's probably not true. Natural levels of ozone, do they kill? I think that's not true either. Environmental estrogens. Any claim coming from meta analysis using observational studies. Keep in mind, the evidence in the public literature is that 80% of all, usually coming from university, science claims failed to replicate when tested rigorously, for example in randomized clinical trials. Emotion and scare. The whole aim of practical politics is to keep the population menaced, scared and hence clamorous to be led to safety by menacing the population with an endless series of hobgoblins, all of them imaginary. Flim flam, deception, confidence games involving...involving skillful persuasion or clever manipulation of the victim. So H.L. Mencken in the 1930s said that practical politics is largely scare politics. We finish up with the the authors of this in this of this talk. I'm Stan Young and I can be reached at genetree@bellsouth.net. Warren Kindzierski is a Canadian epidemiologist and he's been working with me very closely for the last couple of years on papers that are based on the things that we've seen here today. Paul Fogel is a very interesting statistician, lives in Paris and he was responsible for the writing of the add-ins and scripts that we can use. I will say that the scripts will allow you to go from a meta analysis to an evaluation in probably a little bit less than an hour. I'd like to recommend the National Association of Scholars report, Shifting Sands Report 1, this URL will get you there. This report is a long and involved involved, but simple simple statistics report talking about air quality and health effects. And with that I'll also make the following claim, if someone watching this wants to try out what's going on, we will give you the scripts and add-ins to JMP and then I will even help you look at and interpret your particular analysis of a meta analysis. So with that I'll stop and I'm prepared to answer any questions. Thank you.

0 attendees

0

Event has ended