Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Biodiversity loss is a global challenge. Reliable data on the numbers and distribution of species are urgently needed to stem the species hemorrhage. But getting those data is hard. Endangered species are elusive and cryptic, and current monitoring techniques often expensive and unreliable. Frederick Kistner is a founder member of the WildTrack Specialist Group. Together with colleagues Larissa Slaney from Heriot-Watt University, and Zoe Jewell and Sky Alibhai from WildTrack, he is developing a method to identify four overlapping species of otter that exist in southeast Asia: the Eurasian otter (Lutra lutra), the short-clawed otter (Onyx cinereus), the smooth-coated otter (Lutrogale perspicillata), and the hairy-nosed otter (Lutra sumatrana). Morphometric features of otter footprints are extracted using a customized JMP script, implemented in the Footprint Identification Technology (FIT) add-in. The data are partitioned using the JMP validation column creator. Morphometric features are used as input variables to predict the otter species as the target variable. The automatic model selection feature in JMP then identifies the best model. Initial findings comparing prints from Asian short-clawed otters with Eurasian otter yielded 100% classification accuracy on the training (50%), as well as on the test set (50%).       Hi, everyone. Thank you for joining us. The title of our talk today is It's Otterly confusing! Short-C lawed, Hairy-N osed, Smooth-C oated, or Eurasian? Just ask JMP! These four species occur in same habitat in Asia, in Eurasia, and the question is, how do you tell them apart? Using our FIT technique, the footprint identification technique developed by WildTrack, we can actually tell them apart. The presentation will be by four people: Fred Kistner, who is at the Karlsruhe Institute of Technology in Germany and also a member of the WildTrack Specialist Group; Larissa Slaney, who is a PhD candidate at Heriot- Watt University, FIT Cheetahs Research Project and also member of the WildTrack Specialist Group. Zoe Jewell and myself, I'm Sky Alibhai, faculty at Duke University and founders of WildTrack and developers of the footprint identification technique and also members of the WildTrack Specialist Group. I'm going to do a very brief demo, a very, very brief introduction, and then I'll hand over to our next speaker. Apart from ultra species that we work on, WildTrack has an extensive number of eminent species that we work on in different parts of the world, ranging from the Amur Tiger in China to black rhino in Africa to jaguar in Brazil. All of them utilizing in one form or another are footprint identification technology. Now what does footprint identification technology actually do? One of the things about the footprint identification technology, FIT, is that it works as an add-in in JMP. It's designed to classify species or subspecies by using the metrics from the footprints, classify sex, classify age class, and even classify individuals. Now those are all elements that are required to understand the population dynamics of any endangered species, the essential foundation elements. The conservation applications of footprint identification technique, the baseline data on numbers and distribution in form on: data- driven scientific conservation strategies, trade in endangered species, human/ animal conflict mitigation, and all these in a way will be shown to represent the way in which otter conservation works. Now I'll hand you over to Larissa Slaney, who will start the process of deconfusion. Thank you. Right. Thank you very much, Sky, for this great introduction. Thank you to JMP, for allowing us to present our research here. We're very pleased about that. Thank you so much for all of you to be here and show this interest in our research. Now before we're going to jump, pun intended, into explaining our data analysis with JMP, I would like to give you some background to this research. We think it is really important to look at this in context because it's not just about what JMP can do, but how this is applied in the real world and how we scientists can use it to... It gives us the opportunity to make changes for the better, basically. Our research looks at footprint analysis as Sky already just said. We are looking at the footprints of Asian otter species. Now here you can see four different Asian otter species and they're all classified as vulnerable or critically- endangered by the IUCN, and the home ranges overlapped, so that's so- called sympatric otter species. On the top left, you can see the smooth-c oated otter. Top right, you can see the Asian short or small- clawed otter, bottom left, the hairy- nosed otter, and at the bottom right, you can see the Eurasian otter. Now, to be able to monitor the different species, we need to collect data, which is difficult with such very elusive species. You hardly ever see them in the wild, but you do see their footprints. Looking at these footprints for each species, when you look at them here, they look very similar, don't they? Now, it's quite tricky to tell them apart. Therefore, we set us the task to find out whether there is a way for FIT to distinguish between the footprints of these four otter species in a scientific and reliable way. Now, we do have an added problem here, because the front footprints of two of the species have a size overlap and the hind footprint of another species is morphologically very similar, and also size- wise quite similar to the front foot of another species. We've got a multi-class classification problem here. Now here you can see a map which shows the distribution ranges of the different otter species. The blue area here, that's a Eurasian otter. Then the red area is the smooth- coated otter. The yellow down here is the small- clawed o tter, and then the pink over here is the hairy-nosed otter. But you can see it, although it looks like a large area, there's actually just a few dotted islands there. What is really interesting about this map, though, it shows you where their home ranges overlap. There are six areas where at least two, if not even three of the species overlap. For conservationists, it's really important to find out where the different species live, to what extent , how large the populations are, and find out as much as possible about the different populations so that we have a good idea of how endangered they are. Now, why is otter conservation important? Well, first of all, otters are classed as Keystone species, and that means that they have an effect on their environment disproportionate to their abundance. Just a few individuals can have quite a big impact. They play a really important part in the food chain and contribute to the environmental equilibrium. They're also seen as Umbrella species, which means that they confer protection to a large number of other species. Basically, if something happens to the otters, that will have an impact on other species as well. They're also an Indicator species, so they actually indicate the health of their environment. They will not live in polluted waters or in polluted wetlands. Otter returning to an area is always a good sign, because that means water quality and wetland health is improving. Now, threats to otters. There are lots of different threats. Pollution is the one we've already mentioned just now. But another problem is the human wildlife conflict, habitat loss. With that also comes loss of prey. A n increasing problem, especially in Asia, is the illegal trade, the illegal wildlife trade. They are particularly after the fur, the fur trade and also after pets. Baby otters are taken out of the wild and used as pets, which is not good for otter conservation at all. Now how do you approach a conservation project like this? Well, first of all, you need to think about how do you want to monitor the population. What do you want to look at? Do you want to look at species distribution? Yes, almost always. Do you want to look at individual ID? Do you want to find out what the sex ratio within a population is? The next thing you need to decide is do you want to use invasive methods that potentially stress or even harm the animals? Or do you want to use non- invasive technologies or methods to monitor the species which will not stress and harm the animals? We, at WildTrack, we focus on non-invasive ways to monitor species. In this particular project, we are completely focusing on footprint identification. Now, once you have made those decisions, you need to obviously train people to help you with the data collection because you can't be everywhere and you can't go everywhere. During times of pandemics, it's even more difficult. So you need to train your team both in- person as well as remotely so that has been a bit of a challenge. Then you need to get all the data collection, and the training and the data collection can happen in-situ, which means in the field or ex- situ, which means in zoos and other conservation organizations. Once you get the data in, that is when you start the data analysis. In our case, that's when we start using JMP. Other typical issues for any conservation project is funding, of course, and also trying to get conservation policies improved, and management strategies for conservation to have those improved. That's basically our end goal. We are collecting all the data. We are analyzing all the data so that at the end of the day, we can give that information to governments or other organizations and they can make an informed decision and make better conservation policies. Let me just go back one more time. On the left hand side here, actually, you can see one of our lovely zookeepers collecting footprints for us. On the right hand side, you can see a footprint image that was sent to us from the wild. That's a mystery footprint. We were asked if we could please find out which species left that footprint behind. That's really what we want, that researchers start to send us footprints and we can help them find out which species lives in their area. Fred will, hopefully later on, help us reveal which species this footprint belongs to. Now, we've asked ourselves three research questions, and at the moment we are still focusing on one. This is an ongoing project. A t the moment, we are focusing on species classification. Can FIT, the footprint identification technology, identify or distinguish between the four different species of otter we are looking at? When we've got enough data and enough particular data, where we definitely know the individuals, we will look at individual classification and also at sex classification. But that's going to be a bit further down the road. So far we have teamed up with nine zoos and otter conservation organizations. We've been training them to collect footprints following our FIT protocol. This has been, again, during COVID, quite challenging. I've not been able to see everybody in person, so some people I've had to train remotely, but they've all been absolutely fantastic, our zoos and zookeepers, and have really risen to this challenge and have started to really send in a lot of images as you can see here on the left. It's still overall much smaller sample size than we want to have. As I said it's an ongoing project, but it is enough to give us the ability to now to share some preliminary results with you so we can draw some conclusions. We've included three otter species in this so far. We've only just started to begin to get h airy- nosed otter prints. There's only one h airy-nosed otter incaptivity in the whole wide world. His or her, I'm not sure, prints are just starting to come in, and we will update at a later time the results with this fourth species in it. But for now, we're going to look at three otter species. Yes, so I think it's time to have a closer look at how we do the data analysis over to Fred. Thank you, Larissa, and let's jump straight into action. Like Sky mentioned previously, FIT has been developed for a wide number of species. When it's fully developed or after leaving production, it's an add-in into JMP. Today, I'm going to demonstrate some parts of the data analysis and some parts of the development before it comes into production. What I am going to say is, in general, I just wanted to give you a little bit of a background of how this development is usually done. Our input data is collected with very little equipment and very simple equipment. That's one of the main advantages of FIT, that it can be widely applied with very little equipment. You only need a smartphone and a ruler. If you want to develop FIT models for certain species, you start with an image database that is usually collected of known individuals as Larissa mentioned. We therefore cooperate with zoos and other wildlife centers. These images are then processed within JMP to extract geometric profiles that extract a lot of measurements, angles, distances. This data can then be used to develop FIT models. The general output is that you want to look at species, sex, individuals. If you're able to edit for individuals, you want to draw conclusions about population size. Once you develop the method, you definitely want to test this on unknown individuals. Again, you look at images and get a prediction of the models. Advantages of FIT based on biometric, it's non-invasive. It's a standardized and cost effective way to monitor elusive wildlife that cannot be monitored by direct observation. It can be implemented for almost any species that leaves a footprint. It can be combined with other non- invasive methods and cross- validated models generally have a high accuracy. How to build these models is something that I would like to demo. What I'm going to demo today is technically looking at different footprints. You see on the top left, you see a hind foot of an Asian small-clawed otter. On the top right, you see a left front of a smooth-coated otter, on the bottom left, a left front of a Eurasian otter, and on the right hand side, you see a right foot footprint from an unknown otter from Nepal. What we are going to do today is we process these images. Then I'm going to show you how to quickly develop a classification model within JMP. Then I'll see what the predictions of these quickly develop methods are going to be. It all starts with image analysis. That's script-based implementation within FIT. In the first step, you usually adjust the size of an image so that the footprint is clearly visible and the dominating part of the field. In order to be replicable, it's important that footprints are aligned following defined rotation points. For otters, these are rotation points below the second and the fourth toe. Then you set a defined set of landmark points. Again, for otters, this is species- specific. But for others, I've chosen 11 landmarks. They're in the center. Sorry, forgot one step. Of course, you need to define a scale first. Here we got 10 centimeters. This is up here. You can add some additional information. Just to keep it simple, I will name th is strike Asian short-clawed otter. Then you set 11 landmark points. You could, for instance, use a cost air function if you want to make this as precise as possible, obviously, but for time reason, I'll just quickly run through them. After setting 11 landmarks, you derive additional points, which are helping points that are also used to extract biometric information. Once you've done that, you'll just start a new table and you go for a pen draw. I'll just quickly run through three more images. Again, you need to resize them. Now, with this image, you can see it's upside down. What I like about JMP is that the image window can actually do some image pre- processing. Now, it's right front. Sorry, I need to flip this one more time. Can do some image processing within JMP and so you don't have to change in between software. That's something that I really like that I can do all my work within one software other than switching in between several software. Again, I said 11 landmark points. This time, I'll just go over them quick and dirty, and hope that the prediction will be accurate enough that's a Eurasian otter. Again, I go append just two more times, one for the smooth-coated otter. Again, I will set the 11 landmarks, and what the landmarks are used for, I'll show you in a second. Derive points, append row. One last time, the mystery footprint that Larissa mentioned that was sent to us from a project in Nepal, that is, to my knowledge, doing some otter monitoring there. One of the species has not been seen for at least 30 years. 1, 2, 3, 4, 5, 6 . What's different in here is that you have a different scale. That scale factor is something that I need to adjust within here. Again, I'll quickly click through the images. This is normally done a little bit more tedious, but for demo sake, I'll try to click through them quickly. And this is an unknown species. What you end up with this was the smooth-coated. What you end up is a big data table . These are points for evaluating the quality of the landmark, which I did not go into within here. You get X and Y coordinates for each landmark. These X and Y coordinates are derived to calculate a large number of measurements. There's more than 100 distances derived, some angles and some areas. There's quite a lot of information extracted out of a single footprint. If you repeat this step that I've just shown several times, you'll end up with a data table like this. This is the data table that I'm going to demo the prediction model on. If you have a look what we have here, if you look at the distributions or species or target variable, you see that I have 405 processed images of Eurasian otters, 278 Asian short-clawed otters, and 127 smooth-coated otters. It's not perfectly equally- distributed groups, but at least each group has quite a significant sample size, which will hopefully work for modeling. Whenever you want to do any sort of supervised modeling, it's a good idea to split your data into training and test data. This can be very easily done in JMP. You have to make validation column within the predictive modeling platform. What I've done is I randomly split my data into 80 percent training data and 20 percent test data, where I will test the models that we're going to build on and see how they perform. All right, so I've previously done this. What I'll do now is I'll just select my training data, which are 648 rows, and I will just have a look into a data view. This is 648 observations. I'll quickly save this as my training set. Again, if you have a look at the distribution, you could see that we have, 100 smooth-coated otter prints, 324 Eurasian otter prints, and 223 ASC. It's the same distribution percentages as with the previous data set. In the next step, I will skip out a big part when it comes to predictive modeling, that is a variable selection. I assume that I have no prior knowledge, so just add into all variables that are available. I have no idea which model is going to work best. What I'm going to do here is I'll use the model screening platform that compares several different machine learning models that are implemented in JMP and just compare step performance on this specific task. Again, my target variable is the species. This is the one that I would like to predict. In total, I have 209 measurements extracted from my footprint data and these are my X variables. These are all factors that can potentially be used. What you see down here is you can choose the method that you would like to run and you can basically choose through all the prediction methods. But for argument's sake and for runtime, I'll only run methods that I know that will run through quickly. You can again make it reproducible by setting a random seed. What's also good about a model screening platform is you can add an internal validation step. We already split our data into training and test set. But in order to have an internal validation on these models that I'm going to develop, I'll add k-fold cross-validation. I'll just put a tick in here so that there are several models are evaluated using the k-fold cross-validation method. Okay, if I just click quickly, go on run. I'll have a summary outlook. I'll see that four of my models have been evaluated. You'll get air square, you'll get performance metrics for those several models, and you can just say which one is the best. You just select the dominant one. You can look into the training or into the test set which will also give you a misclassification rate. You can see that misclassification rate for Bootstrap Forest was quite impressive. It's almost 95 percent of the data was correctly classified in the validation set. For argument's sake, let's say, that the Bootstrap Forest, the Decision Tree, and the Discriminant Analysis were the three best models and I'm not sure which one is the best. I can just run those three models as selected. They'll pop up in their respective platform. What I can do is I'll just save the prediction formula into my Formula Depot. I'll do this for the Decision Tree model, so I can close this. I'll do this for the Discriminant Analysis, which is done here. Last but not least, I can do this for the Bootstrap Forest model here. What I have now is... Someone I can close, I'm sorry. I have three models in my Formula Depot. I now want to evaluate how these models perform on new data. Again, I go into my initial data table and I select my test set, so all the variables that have been randomly choosing as a validation set. I'll just quickly save this as Test Set. In the next step, I can technically open my Formula Depot and I can run all those models in the new data table. I will run them on the test set. I want to run all three of the models. I can do a model comparison where I run all the three models on the test set. That's actually what I wanted to show. You'll see down here, I hope you can see this. Try to zoom in a little bit that the misclassification rates of these models were actually quite low. Test set is new data that has not been seen on the models and there was a misclassification rate for the Decision Tree model of 16 percent, while the Discriminant Analysis and the Bootstrap Forest only had a misclassification rate of 12 percent. You have the highest air square value for the Bootstrap Forest, and there are several other metrics that you want to look at. For us, the most important metric is the prediction, right or wrong. I usually look a lot at the misclassification rate. But obviously, all the other metrics can be used to evaluate and generate good models. Last but not least, we want to have a look of how these models actually perform on the footprints that were just previously processed. Again, we go back into our... Now, where is it? in our Formula Depot and we'll run this at four images that we previously processed. If we open this data table, you can have a look at the prediction. What we can see here. I'll make it a little bit more obviously. So this was the original data, and if I just look into the species for the first model, everything was predicted correctly. The first model was the Decision Tree that predicted ASC for ASC, Eurasian otter for Eurasian otter, smooth-coated for smooth-coated, and same with the Discriminant Analysis and same with the Bootstrap Forest. All three models actually predicted the right species to the images that we have processed, and all models were also consistent about their prediction when it comes to the Eurasian otter. I'm doing my PhD mainly on the Eurasian otter species. This was also something that I would have guessed as an expert on this particular species footprint. So it's quite consistent and it works really well. Obviously, you can add more steps and you can fine tune your classification rate even more if you look more into feature selection. But I think what I wanted to show you with this demo is how easily a question of what species it is can be answered using JMP. I hope this was quite intuitive. Let me draw some conclusions on that, some results. What I really like about JMP is a great all- in- one solution. I can extract biometric data from images within the FIT add-in. I don't have to switch software to do the data analysis. Directly, my extracted biometric data can be analyzed. And not only analyzed in a descriptive way, I can build classification models. There's state- of- the art machine learning models implemented into that, and so it's a one- and- all solution. There's obviously other ways to do that as well, but I just like the practicality. For our particular research question, so our otter species specific classification models at this stage, how much data we have, they have performed very well. They're able to protect the species of single unknown otter footprints with high classification rate. For instance, here neural net, which I did not run in this example because the runtime is a little bit longer, had a misclassification rate of only 10 percent on the same test set. It could even increase the classification accuracy of another two percent. If you dive in deeper, you can even increase this a little bit more. One thing that our experience from working with footprints in the field increases the classification accuracy by quite a lot is rather than working with single footprints is to work with trails. An animal, depending on where you are, depending on the substrate, doesn't only leave a single footprint. Without going too much into detail, a footprint can be a quite complex matter because it variates a lot with the substrate that an animal is moving through with the speed of an animal with the gate. There can be quite a bit of noise and background variation. If you work with multiple footprints and take an average of a prediction instead of single footprints, you can address this variation quite a bit more. This comes especially important when you want to look at individual identification. Last but not least, JMP is also great to gain insights on the importance of variables. There's several ways within JMP that you can look into which are actually the variables that are contributing the most to predictions. You can either look at tree- based methods, where you look at the classification trees and where the splits are done. You can look into column contributions again for tree- based methods, if you work with Bootstrap F orest, or XGBoost, or something like that. You can see which columns are actually choosing how many times. You could look into the prediction profiler if you use the normal modeling platform within JMP. You can technically have a look of what the prediction of your model, how it's going to change if you change certain values. Or you can do something like a Discriminant Analysis, where you just follow the F-ratios of how variables are selected. This will then, again, give you a lot of insights because it really depends on your question. Do you want to have a prediction from an external advisor of what species you're looking at? Or do you want to give a guide for working in the field of which measurements are worth looking into when you're into field, when you want to make the classification on the subject, like just on expert knowledge? Yeah, so that's basically it from my side. I would like to thank all the contributing zoos and wildlife parks who allowed us to work with their beautiful animals and share data with us. Especially, I would like to thank Grace Yoxon from the IOSF, the otter survival firm who got us into contact with many of them, KHYS, our Karlsruhe House of Young Scientists who actually funded this study, and most importantly, Joseph Morgan from JMP, who is very helpful when it comes to modeling and FIT. Then he has been giving me advice more than once when I get any JSL scripting issues. Yeah, that's pretty much it from my side. Thank you very much for listening so far, everyone. I just really want to round up here. I think that Sky, and Larissa, and Fred have outlined beautifully the challenges we're facing in identifying these different species of otter around the world and the way in which JMP can help us classify them and bring some clarity to this picture. Where are they and how many are there? I'd like to just quickly talk about what's next. How are we building on this? One of the things we're doing is building artificial intelligence into this picture so that it will allow us to filter and sort a much greater volume of data as it comes in. As both Fred and Larissa have said, we need more data. Artificial intelligence has the potential, which we've already exposed in some early training and test and field data as you can see at the bottom of this screen. We're getting reasonably good accuracy in our initial trials with AI. We think that it will never give us quite the resolution that JMP will give us. What we're aiming to do is have an AI platform which will be easy for citizen scientists to feed data into. We'll have JMP as the top level classifier on that platform by integrating it. But the key here really is that this cryptic ground evidence left behind by otters and all the other species is there for us to decode if we can find a way to do it. It really is transformative in conservation to be able to have a cheap and quick technique to know where these endangered species are. We're very optimistic that using a baseline AI classifier with JMP as a final classifier, we'll be able to make that technique not only deliver the data we need, but integrate people all over the world as citizen scientists to be part of that. We're really grateful to the JMP community for supporting us through this whole journey. We're constantly making new strides. We know that there's interest in this community, and we hope that they will join us when our new mobile app comes out to even start collecting data themselves and pushing this forward to where we want it to be, which is a classifier for all endangered species all over the world. Here's to lots and lots of points on the map, and thank you all for listening.
SPC and control charting is a common procedure in industry. Normally, you are controlling and observing a single measure over time. These data are displayed with ±3s limits around the mean on a chart.   However, when kinetical curves or other time-dependent behaviors are a matter of quality and consistency of a process, it is much more difficult to display in a SPC chart. These curves are often displayed within their maximal and minimal specification for each time point, which makes the off-spec curves visible. How can “off-spec” curves be defined while staying within these max-min limits?   The first method to try might be the principal component analysis (PCA). If the runtime stamps are always the intervals, then it's easy to achieve results. However, if they vary, interpolation of the Ys that will be on the same stamp is needed, which complicates data preparation.   With the Functional Data Explorer in JMP Pro, it becomes very convenient to display the different curves as principal components in a T 2 -control chart.   This presentation shows how we used this tool for quality control for a pressure leakage test and how we made it simple for the practitioner to use.     Okay. Hello. Thanks for the nice introduction. Today I want to present together with Stefan Vinzens from LRE Medical a work we've done in the last year on control charting kinetical and other time- dependent measurements. So why is that important to present? So in case you have a time- dependent curves and the curves themselves are important for the quality of your product, so it's often very hard to define any kind of specification. And in our case here, these curves were evaluated by specialists. So every measurement was sent to a specialist, looked at the curve, said, "Yes, okay, or not okay." This is pretty time- consuming and compassion and so on. And on top of that, it's even bad when the person is sick or in vacation. Also it's a person- dependent thing. So often it happens that the person has different moods or different obligation or different priorities and so on. So let's say the adjustment of the curve may vary a little bit. So it's kind of reproducibility problem And we wanted to stop that. So that was the reason why we started with it. So for example, here you see a selection from hundreds of these curves we measured. And you see here it's a pressure holding measurement. So we have here pressure versus time in seconds. And here you see a bunch of curves. And the green ones, you see they are called true. This means true is accepted, they are good. And false means rejected, they are not good. So there is a bad product. So what you see here that the green ones are relatively tied together. Here in this highlighting screenshot, you see better. So they're pretty close together. And then we have a selection of red ones which are either apart from the green ones, but there are also two ones which are more or less in the same regime. But as you see, they have completely different shapes. There are some edges and other s-shaped curves and so on and to forth. So if it would make just a simple t hree-sigma limit ± around the good ones, for example, as in the upper case here, we would include also the nongood red ones here from the lower picture. So a simple ± three Sigma limit approach around this series of curves would not be target leading. So we need something which includes the position which is done here with the basic monument. But also we want to have something which takes care about the shape of these functions, of these groups, How to analyze the shapes and positions? So there are actually two approaches. One is pretty long- known already. This is the principal component analysis. That's the first one. And in more recent times JMP came up with a functional data explorer which also gives us the possibility to do the same. But as we have seen, it was not really the same, so it was different. So let's start with the old approach, principal component analysis For doing so, you need to transform the long table I just show you in the next picture here on the left side, a long table where you have columns with the part number, for example, with the test date. But also and this is the important part of the runtime and the pressure and this for each value. And you see here, this is in seconds. So we measured every some milliseconds. And as you see also, it is pretty hard that for the next part it's exactly the same series of numbers here, exactly the same time when we have the next pressure point. And this is actually needed when you want to make a principal component analysis because you have to transform this long table into a wide table where you have one row per part and then you have the white table where you have for each time slot and then the data points. So what we need to do now is, first of all, we need to bring all these runtimes on the same scale. Here, we have done that by just calculating the longest time as 100% and the shortest is zero. And then we have every numbers from millisecond seconds transferred into this percent scale. Then we interpolated because still they were not on the same time slot actually. So we had to bring all the Y numbers on the same time point. This means we need to do a kind of interpolation to have them all on the same. And then when we have that, so we created a new column with the standard relative times. I've just seen them in the example in the next slide, so 00001 and so on. And then we transpose them into this white form. And from there on, then we could do the principal component analysis, saving the printable components and calculate the T squares from them and build the control charts from these T squares values. So here you see it's the slot 0%, 1%, 2% and so on. And here are the pressure values. In a parallel plot, it looks kind of like this. So we have here the different slots here on the X- axis and the pressure is still the same numbers as before. And you see that the curves are looking pretty much the same as before. But now we have about 100 data points and before we had thousands. So it's the density of the points are a little less, but still the curves and the shapes and the positions and everything are the same as before. And we take now this white table and on all these different time slots, we do a principle component analysis. Then we find here this scoreboard for it. What you already see here is that in the middle part, we have here all the green dots and the green ones, as you see here are the accepted ones or the good parts surrounded by the red dots, the rejected ones, the false ones. And as you see also in this example here we can stay with two principal components because the first principal components covers already 98% of the variation in the second one, 2% and the other ones you can neglect. So I think it's good enough to save just the first two principle components. And then from these two principal components, we calculated the T squares. This means T Square is principal component, one squared plus principal components. Two squares would be then the T squared here for this data point. Calculating this one . So for every data point we calculated, this T square and bring it on a control chart Here, you see the control limits calculated only from the good, from the green dots, from the good runs, from the good curve. So you see it down here from the moving range. So this means, okay. This controlling, it represents the normal natural variation, what we expect from the good ones. And all the non- good ones, the red ones should be outside of this regime here. And you see it's mostly done, but not for these two points here for part number 14 and part number 15. So they're inside the control limit. But that is not what we want to see. So first of all we have to understand what are these two. So if you have a look onto the next picture, then and highlight just these two, 14 and 15, then we see, "Ah, okay." T hese are the ones here which are really within the regime of the green ones. The other ones which are further apart from it, they are easily excluded or not excluded, but let's say distinguished here with this control shot. But these parts were not. So what we learned from here, that is the principal component approach as we have done it here or have done it in former times, there is no information about the shape of the curve. It's more information about the position where it is on this Y scale here. So we need something else which takes care about the shape of this curve. So when we tested the FDE, the Function Data Explorer, and the first good news is you can just take the long table as is. So there's no need for data transformation or bringing the different time points on the same slot number or whatever. So you just take the raw data as they are. Perhaps you will exclude some real outliers where there's the machine has given wrong numbers or whatever. But all the other stuff you can just take as is and sometimes perhaps you don't want to do it on the transformation or so. It's not really necessary. And in this case, we have just taken the raw data as they are. So starting doing an FDE on this, we see here again our pressure time curve as we've seen them before. But now you see these blue verticals here and these blue verticals represent the number of knots. So what is a knot? So here we are fitting a spline curve. And the spline curve is, let's say if you take a ruler and then you want to make it a bit more flexible to bend it onto all these different curves or, let's say, to adjust it the best way to the data. And the more, not this ruler has the more flexible it is. And so the better you can adjust all these different points. Here, we used these 20 knots thing. So these are 20 verticals here and here with the basic information criterion, you see that we get on all different functions. We get the smallest BICs. And just have an optical control on this, you see on the lower left, all the data points separated out for all the different curves here and with a red line on top of it representing the spline curve which we fitted. It doesn't matter if the curve is pretty straightforward like here or has more edges or whatever, it's perfectly aligned. So just optically, it looks very good. And you can also check that with the diagnostic plots. For example, here this blind curve, the predicted values are displayed versus the actual ones. And you see they are perfectly aligned around these 45 degree line here. And also the residuals, the ones are left and right here from this prediction line, they are pretty small, so there's not a lot of error left, which is not represented by this blanker. So the fit looks really, really excellent. So we have not taking the parameters for this split, sorry for the spline and did a functional FDE. Well, I'm completely confused today. And we did a principal component analysis on these. So we separated the eigenvalues, which is the weighing factors, and the eigen functions for each of these curves. Then we can display them on a scoreboard. Here the eigenvalues for each of these curves. And you see here, again with the BIC credit criterion, you see how many numbers of functional principle components you should optimally use. And you see is the minimum with two. So we stayed with two again as the one before. And now we have this score board, and it looks pretty much the same as the one before. We have here the center part, the green of the curves, which are good, surrounded by the red ones, which are rejected from the specialist. And now we saved all these functional prints and components and built a T square plot on it. And here again, the same picture, the square. If we take all the data points to calculate the control limits, okay, then besides these two, everything is part is in control. But this is actually not what we want to have. So we want to understand what is the normal variation of the good ones to differentiate them from the variation of the non- good ones. And so down here on the second plot here, we calculated the control limits only from the good ones. And then you see, okay, nearly all red curves are outside. We're out of control. But there are two. So the number 14 and the number two here are again directly on the borderline. But all the others are clearly separated. So let's have a look on the two and 14 which curves these are. And then we see here. Ah, okay. So they are really different from the red ones, that's clear. But also kind of different from the majority of the green ones. So the number two, which is defined as being true is the one with a steeper slope here compared to all the other green ones here. And the number 14 is this guy here, which is more or less on the upper end of the regime of the greens. So obviously, this little differences here in the shape is not strong enough to come up really in this statistical calculation. But we could really exclude here this guy here from the others with a strong edge here. And perhaps this one if the expert would have a look onto this second time or perhaps it would have rejected it because it's too steep a slope or whatever. So these are border lining, I would say in both cases here with the statistical approach but also, I guess with the manual approach. But we could clearly detect number 15, which is the one here in the middle, as being a non- normal behavior with this tool. So you see, the FDE has also some limitations. But overall, in addition to the standard PCA approach, it comes up with these shape information of the curves. And this is also part of distinguishing in a control chart if a curve or a measurement is normally varying or is it not normal anymore. To conclude these things, the standard PCA approach combined with the T square analysis or control chart is good for detecting the position of the curve and distinguish them from curves which are not in this regime. But as we have seen, it's lacking if the shape of the curve is of importance. And with the FDE approach, again, combined with the T square analysis, first of all, it needs much, much, much a less startup preparation. But on top of that, it's good for checking the position as the above, but also because it has shape information of the curves. It is also including that in the good nongood understanding. In our work, we stopped at this point, but principally, you can build a kind of automatic. If you use these curves or these principal components and the information of good and bad, you can automize the good and bad. But if you have a model behind which predicts you, then what you will see. But we haven't done that. We stopped here on the T square chart to understand the variation and what's varying more than normal variation. Thank you very much for your attention. And if you have questions, I'm open to answer them now. Thank you.
Repeated k-fold cross-validation is commonly used to evaluate the performance of predictive models. The problem is, how do you know when a difference in performance is sufficiently large to declare one model better than another? Typically, null hypothesis significance testing (NHST) is used to determine if the differences between predictive models are “significant”, although the usefulness of NHST has been debated extensively in the statistics literature in recent years. In this paper, we discuss problems associated with NHST and present an alternative known as confidence curves, which has been developed as a new JMP Add-In that operates directly on the results generated from JMP Pro's Model Screening platform.     Hello, my name is Bryan Fricke. I'm a newly minted Product Manager at JMP, focusing on the JMP user experience. Previously, I was a software developer working on exporting reports to stand alone HTML file files, JMP Live, and JMP Public. In this presentation, I'm going to talk about using confidence curves as an alternative to Null H ypothesis Significance Testing in the context of predictive model screening. Additional material on this subject can be found on the JMP Community website and the paper associated with this presentation. Dr. Russ Wolfinger, who is a distinguished research fellow at JMP, is a co -author, and I would like to thank him for his contributions. The Model Screening platform introduced in Jump Pro 16, allows you to evaluate the performance of multiple predictive models using cross- validation. To show you how the Model Screening platform works, I'm going to use the Diabetes Data table, which is available in the JMP Sample Data Library. I'll choose Model Screening from the Analyze Predictive Modeling menu. JMP responds by displaying the Model Screening dialogue. The first three columns in the data table represent disease progression in continuous, binary, and ordinal forms. I'll use the continuous column named Y as the response variable. I'll use all the columns from Age to Glucose in the X Factor role. I'll type 1-2-3-4 in the Set Random Seed input box for reproducibility. I'll select the check box next to K Fold Cross validation and leave K set to five. I'll type three into the input box next to Repeated K Fold. In the Method list, I'll unselect Neural, and now I'll select OK. JMP responds by training and validating models for each of the selected Methods using their default parameter settings and cross validation. After completing the training and validating process, JMP displays the results in a new window. For each modeling method, the Model Screening platform provides performance measures in the form of point estimates for the coefficient of determination, which is also known as R squared, the root average squared error, and the standard deviation for the root average squared error. Now I'll click Select Dominant. JMP responds by highlighting the method that performs best across the performance measures. What's missing here is a graphic to show you the size of the difference between the Dominant Method and the other methods, along with the visualization of the uncertainty associated with the differences. But why not just show P values indicating whether the differences are significant? Shouldn't a decision about whether one model is superior to another be based on significance? First, since a P value provides a probability based on a standardized difference, a P value by itself loses information about the raw difference, so a significant difference doesn't imply a meaningful difference. But is that really a problem? I mean, isn't it pointless to be concerned with the size of a difference between two models before significance testing is used to determine whether the difference is real? The problem with that line of thinking is that it's power, or one minus beta, that determines our ability to correctly reject a null hypothesis. Authors such as Jacob Cohen and Frank Schmidt have suggested that typical studies have a power to detect differences in the range of 0.4 to 0.6 . Let's suppose we have a difference where the power to detect a true difference is 0.5 at an Alpha value of 0.05 . That suggests we would detect the true difference on average 50 percent of the time. In that case, significance testing would identify real differences no better than flipping an unbiased coin. If all other things are equal, Type One and Type Two errors are equivalent. But significance tests that use an Alpha value of .05 often implicitly assume Type T wo errors are preferable to Type One errors, particularly if the power is as low as 0.5 . A common suggestion to address these and other issues with significance testing is to show the point estimate along with confidence intervals. One objection to doing so is that a point estimate along with a 95 percent confidence interval is effectively the same thing as significance testing. Even if we assume that is true, a point estimate and confidence interval still puts the magnitude of the difference in the range of the uncertainty front and center, whereas a lone P value conceals them both. Various authors, including Cohen and Schmidt, have recommended replacing significance testing with point estimates and confidence intervals. Even so, the recommendation to use confidence intervals begs the question, "Which ones do we show?" Showing only the 95 percent confidence interval would likely encourage you to interpret it as another form of significance testing. The solution provided by confidence curves is to literally show all confidence intervals up to an arbitrarily high confidence level. So how do you show confidence curves in JMP? To conveniently create confidence curves in JMP, install the confidence curves add- in by visiting the JMP Community homepage. Type "confidence curves" into the search input field. Click on the first entry that appears. Now click the download icon next to Confidence Curves.jmp add in, and now you can click on the downloaded file. JMP responds by asking if I want to install the add- in. You would click "Install." However, I'll click "Cancel," as I already have the add- in installed. How do you use the add- in? First, to generate confidence curves for this report, select "Save Results Table" from the top red triangle menu located on the model screen report window. JMP responds by creating a new table containing, among others, the following columns: Trial, which contains the identifiers for three sets of cross validation results; Fold, which contains the Identifiers for the five distinct sets of subsamples used for validation in each trial; Method, which contains the methods used to create models from the test data, and N, which contains the number of data points used in the validation folds. Note that the trial column will be missing if the number of repeats is exactly one, in which case the Trial column is neither created nor needed. Save for that exception, these columns are essential for the confidence curves add- in to function properly. In addition to these columns, you need one column that provides the metric to compare methods. I'll be using R squared as the metric of interest in this presentation. Once you have the Model Screen and Results Table, click "Add- Ins" from the JMP's main menu bar and then s elect "Confidence Curves." The logic that follows would be better placed in a wizard, and I hope to add that functionality in a future release of the add- in. As it is, the first dialogue that appears requests you to select the name of the table that was generated when you chose "Save Results Table" from the Model Screening Reports red triangle menu. The name of the table in this case is Model Screening Statistics Validation Set. Next, a dialog is displayed that requests the name of the method that will serve as the baseline from which all the other performance metrics are measured. I suggest starting with the method that was selected when you clicked the Select Dominant, or in this case, I selected or clicked Select Dominant option in the Model Screen and Report window, which in this case is fit step wise. Finally, a dialogue is displayed that requests you to select the metric to be compared between the various methods. As mentioned earlier, I'll use R squared as the metric for comparison. Jump responds by creating a confidence curve table that contains P values and corresponding confidence levels for the mean metric difference between the chosen baseline method and each of the other methods. More specifically, the generated table has columns for the following: Model, in which each row contains the name of the modeling method whose performance is evaluated relative to the baseline method; P value, in which each row contains the probability associated with the performance difference at least as extreme as the values shown in the Difference in R Square column; Confidence Interval, in which each row contains the confidence level we have that the true mean is contained in the associated interval. And finally, Difference in R Square, in which each row is the maximum or minimum of the expected difference in R squared associated with the confidence level shown in the Confidence Interval column. From this table, confidence curves are created and shown in the Graph Builder graph. What are confidence curves? To clarify the key attributes of a confidence curve, I'll hide all but the Support Vector Machines confidence curve using the local data filter by clicking on Support Vector Machines. By default, a confidence curve only shows the lines that connect the extremes of each confidence interval. To see the points, select Show Control Panel from the red triangle menu located next to the text that reads Graph Builde r in the title bar. Now I'll Shift click the Points icon. JMP responds by displaying the end points of the confidence intervals that make up the confidence curve. Now I will zoom in and examine a point. If you hover the mouse pointer over any of these points, a hover label shows the P value, confidence interval, difference in the size of the metric, and the method used to generate the model being compared to the reference. Now I will turn off the points by Shift clicking the Points icon and clicking the Done button. Even though the individual points are no longer shown, you can still view the associated hover labels by placing the mouse pointer over the confidence curve. The point estimate for the mean difference in performance between Support Vector M achines and Fit step wise is shown at the 0 percent confidence level, which is the mean value of the differences computed using cross validation. Confidence curve plots the extent of each confidence interval from the generated table between the zero and 99. 99 percent confidence level, which is an arbitrarily high value. Along the left Y axis, P values associated with the confidence intervals are shown. Along the right Y axis, the confidence level associated with each confidence interval is shown. The Y axis uses a log scale, so that more resolution is shown at higher confidence levels. By default, two reference lines are plotted alongside a confidence curve. The vertical line represents the traditional null hypothesis of no difference in effect. Note you can change the vertical line position and thereby the implicit null hypothesis in the X axis settings. The horizontal line passes through the conventional 95 percent confidence interval. As with the vertical reference line, you can change the horizontal line position and thereby the implicit level of significance, by changing the Y axis settings. If a confidence curve crosses the vertical line above the horizontal line, you cannot reject the null hypothesis using significance testing. For example, we cannot reject the null hypothesis for support vector machines. On the other hand, if a confidence curve crosses the vertical line below the horizontal line, you can reject the null hypothesis. For example, we can reject the null hypothesis for Boosted Tree. How are confidence curves computed? The current implementation of confidence curves assumes the differences are computed using R times repeated K Fold cross validation. The extent of each confidence interval is computed using what is known as a Variance Corrected Resampled T-Test. Note that authors Claude Nadeau and Yoshua Bengio noted that a corrected resampled T- test is typically used in cases where training sets are five or ten times larger than validation sets. For more details, please see the paper associated with this presentation. So how are confidence curves interpreted? First, a confidence curve graphically depicts the main difference in the metric of interest between a given method and a reference method at the 0 percent confidence level. We can evaluate whether the mean difference between methods is meaningful. If the mean difference isn't meaningful, there's little point in further analysis of the given method versus the reference method with respect to the chosen metric. What constitutes a meaningful difference depends on the metric of interest as well as the intended scientific or engineering application. For example, you can see the model developed with a decision tree method is on average about 14 percent worse than Fit step wise, which arguably is a meaningful difference. If the difference is meaningful, we can evaluate how precisely the difference has been measured by evaluating the width of the associated confidence intervals. For any confidence interval not crossing the default vertical axis, we have at least that level of confidence that the mean difference is non-zero. For example, the decision tree confidence curve doesn't cross the Y axis until about the 99. 98 percent confidence level, so we are nearly 99. 98 percent confident the mean difference isn't equal to zero. In fact, with this data set, it turns out that we can be about 81 percent confident that Fit step wise is at least as good, if not better, than every method other than generalized regression lasso. Now let's consider the relationship between confidence curves. If two or more confidence curves significantly overlap and the mean difference of each is not meaningfully different from the other, the data suggest each method performs about the same as the other with respect to the reference model. For example, we can see that on average, the S upport Vector Machines model performs less than 0.5 percent better than Bootstrap Forest, which is arguabl not a meaningful difference. And the confidence intervals do not overlap until about the four percent confidence level, which suggests these values would be expected if both methods really do have about the same difference in performance with respect to the reference. If the average difference in performance is about the same for two confidence curves but the confidence intervals don't overlap too much, the data suggests the models perform about the same as each other with respect to the reference model. However, we are confident of a non- meaningful difference. This particular case is rarer than the others, and I don't have an example to show with this data set. On the other hand, if the average difference in performance between a pair of confidence curves is meaningfully different and the confidence curves have little overlap, the data suggests the models perform differently from one another with respect to the reference. For example, the generalized regression lasso model predicts about 13. 8 percent more of the variation in the response than does the decision tree model. Moreover, the confidence curves don't overlap until about the 99. 9 percent confidence level, which suggests these results are quite unusual if the methods actually perform about the same with respect to the reference. Finally, if the average difference in performance between a pair of confidence curves is meaningfully different from one another and the curves have considerable overlap, the data suggests that while the methods perform differently from one another with respect to the reference, it wouldn't be surprising if the differences are the differences spurious. For example, we can see that on average, Support Vector Machines predicted about 1.4 percent more of the variance in the response than did K Nearest Neighbors. However, the confidence intervals begin to overlap at about the 17 percent confidence level, which suggests it wouldn't be surprising if the difference in performance between each method in the reference is actually smaller than suggested by the point estimates. Simultaneously, it wouldn't be surprising if the actual difference is larger than measured, or if the direction of the difference is actually reversed. In other words, the difference in performance is uncertain. Note that it isn't possible to assess the variability in performance between two models relative to one another when the differences are relative to a third model. To compare the variability in performance between two methods relative to one another, one of the two methods must be the reference method from which the differences are measured. But what about multiple comparisons? Don't we need to adjust the P values to control the family wise Type One error rate? In his paper about confidence curves, Daniel Barr suggests that adjustments are needed in confirmatory studies where a goal is prespecified, but not in exploratory studies. This suggests using unadjusted P values for multiple confidence curves in an exploratory fashion and only a single confidence curve generated from different data to confirm your finding of a significant difference between two methods when using significance testing. That said, please keep in mind the dangers of cherry picking and P hacking when conducting exploratory studies. In summary, the Model Screening platform introduced in JMP Pro 16 provides a means to simultaneously compare the performance of predictive models created using different methodologies. JMP has a long- standing goal to provide a graph with every statistic, and confidence curves help to fill that gap for the Model Screening platform. You might naturally expect to use significance testing to differentiate between the performance of the various methods being compared. However, P values have come under increased scrutiny in recent years for obscuring the size of performance differences. In addition, P values are often misinterpreted as the probability the null hypothesis is false. Instead, a P value is the probability of observing a difference as or more extreme, assuming the null hypothesis is true. The probability of correctly rejecting the null hypothesis when it is false is determined by power, or one minus beta. I've argued that it is not uncommon to only have a 50 percent chance of correctly rejecting the null hypothesis with an Alpha value of 0.05 . As an alternative, a confidence interval could be shown instead of a lone P value. However, the question would be left open as to which confidence level to show. Confidence curves address these concerns by showing all confidence intervals up to an arbitrarily high level of confidence. The mean difference in performance is clearly visible at the zero percent confidence level, and that acts as a point estimate. All other things being equal, Type One and Type Two errors are equivalent, so confidence curves don't embed a bias towards trading T ype One errors for Type Two. Even so, by default, a vertical line is shown in the confidence curve graph for the standard null hypothesis of no difference. In addition, a horizontal line is shown that delineates the 95 percent confidence interval, which readily affords a typical significance testing analysis if desired. The defaults for these lines are easily modified if a different null hypothesis and confidence level is desired. Even so, given the rather broad and sometimes emphatic suggestion to replace significance testing with point estimates and confidence intervals, it may be best to view a confidence curve as a point estimate along with a nearly comprehensive view of its associated uncertainty. If you have feedback about the confidence curves add- in, please leave a comment on the JMP C ommunity site and don't forget to vote for this presentation if you found it interesting and or useful. Thank you for watching this presentation, and I hope you have a great day.
As one of the global technology leaders in the wafer industry, Siltronic AG already has strong analytics capabilities. In this presentation, I share my experience in establishing the use of JMP as the standard in my company, as well as my current view on JMP usage and my future roadmap. I hope to discuss and exchange experiences with other users.   JMP has been used at Siltronic for many years now; I have personally used it for more than 10 years. As we serve the quality-focused semiconductor industry, we need to have good tools and establish advanced methods to be successful. At Siltronic, several other tools are also used for data analytics, but many of them require advanced skills. As a result, they are not always accessible by many of the process engineers, most of whom have experience with Excel and basic statistical knowledge. Simply establishing JMP as the standard for statistical analysis does not by itself guarantee that Siltronic realizes the full potential of data analytics. However, many features in JMP help accelerate learning and speed up the deployment of advanced data analytics to improve processes.       Hello everyone, thanks for joining us today and I really appreciate to have the opportunity to tell you my story: How to enhance data analytic skills by advocating JMP in a Company. So this is the outline. After introducing my company Siltronic and myself Georg Raming, I will tell about data analytics at Siltronic and then how I started with JMP and what my way was, what my first target was, second target, and what our current approach is. So about Siltronic. We have four world- class production sites in United States, in Europe, Germany: Burghausen and Freiberg, and in Asia: Singapore. We have around 4,000 employees with global scale and reach and profound knowledge in silicon technologies, more than 50 years. So this is the history of Siltronic. So the first silicon wafers have been developed in 1962. The first 200 millimeter wafer in 1984, and meanwhile we have found that some sites in Portland, United States, and Freiberg in Germany the first 300 millimeter wafer has been developed in 1990 and first 300 millimeter production in Freiberg in 2004. And currently, we are developing a new fab like here written in Singapore 2021. So this is the electronics value chain. Here you can see coming from the raw material, its ultra- pure silicon worth about $1.2 billion, and semiconductor silicon wafers are around tenfold worth $11.2 billion. The semiconductors are even worth much more, and the electronics are around $1 ,650 billion And the high demand for these products drive our business with silicon semiconductor wafers. This is where our sites are in more detail. So we have a 200 millimeter fab in Portland, United States. We have 300 millimeter wafer fab and small diameter fabs in Burghausen and Freiberg as well crystal pulling and 300 millimeter wafer fab. And in Singapore we have a 200 millimeter wafer fab, and 300 millimeter wafer fab and 300 millimeter crystal pulling. So the Singapore fabs are among the world's newest and largest , and Central R& D hub is in Burghausen, Germany. This is how s ilicon wafers are produced. So starting at the raw material ultra-pure silicon, we have two methods for growing single crystals that is Czochralski pulling and Float Zone pulling. And after growing the ingot, the mechanical preparation takes place like ingot grinding, multi wire slicing, edge rounding. And then the wafer steps come like your laser marking, lapping, cleaning, and edging polishing. And for a part of the product epitaxi. Our product portfolio mostly is 300 millimeter wafer with CZ process Czochralski for memory logic and analog, and smaller part is 200 millimeter and 125 millimeter with pulling Czochralski ingots and Float Zone ingots. And there we have applications like Logic, Analog, Discretes, image sensors, Power Optoelectronics, and IGBTs. And special products like highly- doped wafers as well. So our key requirements on the ingot side are purity, homogeneity, mechanical stability, oxygen content, and the more like this. On the wafer side, we have flatness, uniformity, edge flatness, surface cleanliness, and the like more. And to make the requirements a little more impressive and understandable, what means purity of one part per trillion? It is not more than three to four dissolved sugar cubes in a lake like Chlemsee in Bavaria, Germany. And flatness of a wafe r means 20 nanometer in height on a wafer, like a flat leaf on the surface of the Chlemsee . Now about me. So I'm an Electrical Engineer with a PhD in simulation of electrothermal processes. I also have some statistical background like Six Sigma Black Belt, and my task is development of silicon single growth processes at Siltronic in Burg hausen. And I have also many years of experience in data science, like tasks. It's mainly building my own environment for working and the environment for my group and others, and I'm responsible for JMP software at Siltronic for more than 200 users. Data analytics at Siltronic. So we have also data science professionals, and they are providing services to all, and if we need as engineers some reports, they are mainly static, and the definition of new reports takes some time and is not that flexible as we would need. So we need to do it often on ourselves, and the professionals are using server technologies like Cognos Analytics, Python and others more but we are lucky to have the most of our data on data bases. And on the other side, JMP is the standard statis tics tool for everyone. Excel is used additionally. And always with JMP there are some teething problems because some activation energy is needed to make let's say, new stuff or so working with JMP. But JMP allows full scale data analytics for everyone. It is like data acquisition, manipulation, data exploration and visualization, advanced statistics, modeling, DOE, and others. How did I made my start with JMP? So I'm working at Siltronic since 2001, and as far as I remember, I have always been looking for a good general full- scale tool. And around 2009, some years already using JMP, I was attracted by the nice explorative possibilities of Graph Builder in JMP, but I did not feel comfortable with the data table due to lack of understanding. And I felt a complicated data- in procedure because I had my tools in Excel dragging data from database and I had to throughput it to JMP. And what me then gave really a boost is that I understood how to directly import data from database into JMP. And after I understood the how to, I decided to use JMP as my standard tool for data analytics. And I very much appreciate the ability to store the queries in the JMP data table, to have the documentation on where the data comes from and the ability to update . And also nice is that JMP saves graph and other evaluations as scripts, and that I use a lot. My first vision, I was alone. So I did only see me, but I decided to become an expert on JMP. I liked to know every button in JMP, but this isn't possible, I learned later, and I did not see the others, only my environment, and I hadn't yet the idea of collaboration internally. External collaboration is always difficult due to confidentiality of data. And I a lot use data tables with query like shown here with the nice scripts here to update data from database, and the JMP table to work on like here for famous big class data table. I started to explore the features of JMP, but the requirement of my work by far did not reflect JMP's full range. So I started to learn in the community, in the web, and also to explore cases of colleagues, just by interest. And later I saw that deployment of many of these features are also beneficial to my work. And meanwhile, I like a lot the JMP starter window that shows the dynamic range of the software, and I use it a lot for training to show what is possible to have an overview. My second vision was when I recognized the others. I felt that I could support others also in using advanced data analytics, and I started some activities like JMP Workshop. It was one show for all, so I invited all interested colleagues, but I got only very few people presenting. It's difficult to get people involved into that, and the skill level was very different to make the show efficient. And it is even difficult to get some representative data on skill levels of the participants. We also offered some special support one to one, and this worked well for a few people and it was important for me and other trainers to learn what the requirements are of the colleagues, what they really need. Additionally, we offer basic training and this turned out to be the most important and effective measure to get also into contact with new staff and other people and so on. It was also a nice story how to get others involved as a trainer. So we tried to encourage recently hired staff because they are eager to learn, they have available time, resources, and good communication skills and this was quite successful to encourage these people. Last but not least, involvement of management is important to establish a visible collaboration and to justify the effort that is put into this. And this all is not a self- seller, there is a driver needed. My current vision is more on establishing a network and making like a snowball effect because with a growing number of users it's not possible anymore by one person to address all the people using JMP. So the workload has to be distributed, and more communication lines are needed. So my target currently is to make everyone knowing a JMP expert, and to offer easy access to JMP knowledge internally without the before mentioned know-how problem. And to increase usage and knowledge of JMP, and to make the whole story visible, including management and included into the procedures. And that's why we built up a communication structure like shown here. And on the top, there is JMP component o wner we have for each site. And the component owner is responsible for the technical things, software topics, and knowledge training and so. And then we have the power users that are in good contact with each other and with the component owner. And these are that people that should be known by every user, every user should know power user in his Department, in her Department that she or he can reach out for when any questions occur. And other current measures where we get also good support from the JMP team. So thanks to JMP is... Yes, beginners training I already mentioned this is really most important, and also most easy to establish. It's a network for free and you get high visibility. And we also included STIPS by JMP in our training program, and this is excellent for learning statistics and JMP. With Martin Demel from JMP, we installed Jour-Fixe every month, and there we have very good discussions and more and more people get encouraged to participate into this meeting. It's very well working. We included the courses in our internal training system. We also installed a ToolBox. This is a JMP script that collects all files in a folder structure and makes data and analysis accessible by all users. And there are also other measures, of course, depending on the company like special workshop courses, also with other focus like SQL database language, the infrastructure of data and statistics in general. My summary is learning and implementing JMP in a company takes its time. It does not come for free and it needs a lot of personal engagement. But it's worth doing. It will enhance data analytics skills in a company. You need management support. Last but not least to pay the licenses, but also for other things like make it better visible and make it running. And all the solutions of course, depend on the company and on the people. I felt it was a good idea to start small some projects and to see how it developed. It's important to build and enhance networks on this and to evaluate the interactions, what is happening with the people using JMP and how they interact, and if necessary, to rethink the strategy. It's worth doing and it will pay off after a short time by enhanced evaluation possibilities and better decision. And last not least you can see it in the ease of use of JMP resulting in fun. Okay. Thank you for listening. I'm finished with my presentation and I would be lucky to answer questions if there are any from you or how others do on this topic. Thanks.
This presentation demonstrates how consumer research methods, choice design modeling and reliability analysis platforms in JMP 16 were used to help high school students optimize their utility and satisfaction in purchasing the right laptop for school usage, pick the right school courses, maximize their exam testing performance, and optimize their time spent on high school STEM projects (with a STEAMS approach). For example, Choice Design and Model platforms were used to conduct survey analysis of laptop purchasing preferences.  This analysis was supplanted with Reliability forecast modeling (Life distribution) and back of envelope calculations in JMP to provide greater context for optimal decision-making in purchasing the best laptop for use. Subsequent phases of the project included MaxDiff Design and Model platforms which were used to conduct survey analysis of the popularity and difficulty of school courses. The Item Analysis platform was later used to study exam question profiles to help students and instructors assess exam difficulty. The Latent Class Analysis platform was used to study multiple choice exams. Finally, using Explore Patterns/Explore Outliers, potentially unusual patterns in responses among examinees were detected, thus uncovering possible evidence of exam cheating. In this phase of the project work presented, we demonstrate how a modern choice design can be used to optimize survey methodology that avoids sampling bias and we show how to use the Choice Modeling Platform to appropriately analyze survey data.       Hi. Thanks everyone for joining us. The title of this presentation is Choice Design and Max Difference Analysis in Optimizing a High School Laptop Purchase in the Context of a STEM project or STEAMS project. My name is Patrick Giuliano. I am a co- author and co- presenter for this presentation. And of course, this is JMP Discovery Summit Europe 2022. And I'm happy to be presenting today. So before I get into our project definition or project charter, I just wanted to mention some general context for this project. So this project has a STEAMS orientation, which is basically a STEM framework, but with the addition of a focus on practical AI and statistics, and through the lens of JMP. All right. So the opportunity statement for us here is that every year, students in grade nine with respect to Stanford Online High School, they need to take core courses and need to do a series of projects. In fact, there are many projects per year, as many as 150, and many of them require the collection of survey data. So in the context of survey data collection, JMP has a powerful Choice Design and choice modeling platform as well as Max Difference design capabilities, and those can be used to optimize both survey methodology and analyze survey data. Within this particular use case, we're going to use JMP 16 Choice Design choice modeling platforms to study consumer research, and also specifically to assist with the choice of a laptop, the optimal choice of a laptop for a student. We're going to also take this a step further and look at some reliability questions and do some calculations to look at the opportunity costs associated with purchasing a warranty at different stages of ownership. All right. So quick orientation to our STEM diagram here. We had ten respondents in the context of our example, our sample data set. In fact, this is a JMP sample data set, which I will provide on the user community and is also available in JMP Sample Data directory. There is ten respondents in this survey. As you can see here in the lower left hand corner of the slide, there are four attributes that change within a choice set. There are two profiles per choice set. There's eight choice sets per survey, one survey total to be distributed, and of course, ten responses as we indicated. So in terms of the technology associated with the laptop, we're looking at four key attributes. We're looking at hard disk, drive space, processor speed, battery life, and the computer cost. And we can see a picture of the design as well as the different choice sets that are paired by number with the different attributes in the columns. And then, quite nicely, we see a probability profiler which really just shows the opportunity space in relation to the changes in the axis across the four parameters on the right result in a change in probability or likelihood of purchase. Okay. So let's provide an orientation to the science and the statistics with respect to consumer research. So we're going to think about collecting information, and we want that information to somehow reflect how customers use their particular products or services in general. We want to have some understanding of how satisfied customers are with their purchase and what features they might desire. What insights they can use to improve the problem statement that they're working on in the context of the purchase that they're making. In this project, we use JMP's Consumer Research menu, and specifically, we're going to focus on the Choice Design platform. Okay, so here are some nice graphics that just show, that highlight the consumer research process. And like many processes, we can see that it's very iterative and cyclical. It's focused on strategy development, decision making, improvement, and solving challenges. But as we'll see later in the analysis, there are specific modeling considerations that Choice Design takes into account to make our modeling procedure a little bit simpler and more effective for these particular types of problems. All right, so going back to our overview of the study. So the voice of the customer really speaks to how the manufacturer decides to construct the design in our case. And so we have two sets of profiles, again, that will be administered to ten respondents. And the goal is to understand how laptop purchasers view the advantages of a collection of these four attributes. All right. So our particular use case here is going to be the Dell Latitude 5400 Chromebook. This is a very common budget laptop that would be considered appropriate for student usage. Okay, so here's an overview of the Choice Design modeling platform. We can see that the four parameters are specified under the Attributes section. HDD size, processor speed, battery life and sale price, and the high low levels of the design are specified over at the right under Attribute levels. So you can see that by generating this design, JMP gives us a preview of the design and the evaluate design platform with our choice sets, and our hard drive stays, our disk size, our speed, our battery life, and our corresponding price for each of the choice sets, where the choice sets represent a choice between one computer possessing a certain set of features and another. So as I mentioned before, we're going to bring in the sample data and actually point to the specific location where the data is located. We're going to go ahead and put that in the community for our users to practice with this data. Okay.` Here is the display of the model specification window for Choice Design. So after we've generated the design, we have to fit the design. And so we can see that the structure of a design is ten respondents that are paired. And so we're going to set the data format to one table, stacked. Our select data table, it populates into the Laptop Results Data table. Our response is related to probability of purchase, which is related to price. We put subject into subject ID, choice set in choice set ID, responding and grouping. We cast our four attributes into the X role here. And then we have a subject effects, or we can have subject effects, which are optional, which we didn't consider in this particular context that we can have them. And we also have an option for missing value imputation, which is nice. One of the things you'll notice here, though, is that if you're familiar with JMP, you'll see that the run model option is here, but there are no other options and in many other modeling platforms. In JMP, we have what's called personality selection. In this particular context, we're limited to a specific model under a specific framework only. Why is that? Well, from our perspective, this is likely because this modeling strategy is very specific consumer research, and we would really only want to consider main effects between choices because those effects reflect the choice modeling structure that we're implementing. We pick from either one set of features or the other. Okay. So principally, what a mathematical model or what statistical model is underpinning this type of procedure? It's really a logistic regression type model because our wide response is a probability. Our response can be thought of as whether we're going to make a purchase or not, and therefore is the likelihood of making a purchase. And so what we see in the summary of the model output is that a negative estimate on the terms in the model indicates that the probability of purchasing chance is lower at that particular attribute level. And so we can see from the summary, the parameter estimates and from the ranking of the effects summary, and that buyers generally prefer a larger hard drive size, faster speed, longer battery life, and the cheaper laptop. And we can see that speed and price are really the most significant predictors on probability of purchase. And we spoke a little bit on the buyer side about why there isn't an interaction term. We think that it's really because in the context of this research problems, it's not really a practical consideration. But certainly, it's not likely due to lack of degrees of freedom because in this particular data set, we had over 100 observations and we're only fitting four terms. Okay. So another thing to notice here, I think, that's not immediately obvious on the slide is there's some discussion or notes at the bottom of the primer estimates that say converged in the gradient. And that just really speaks to the fact that this model estimation procedure, this likelihood- based modeling procedure is an iterative procedure. It's not necessarily deterministic, and it involves iterating to find an optimal solution. Okay. So let's take a look at the effect marginal analysis and the context of this report. So we can see here that the report shows marginal probability for each of the four attributes that they're different levels. And so what I'm highlighting here is that the marginal probabilities that are the most different from each other indicate where there's the most differentiation in terms of making the purchase decision. So clearly, price and computer processing speed are the most important in terms of really driving that probability decision. And so as an example here, you can see that 71% of buyers may choose an 80 gigabyte over 40 gigabyte hard drive size, as indicated by the marginal probability of .7129, and 68% of buyers may choose, for example, $1,000 over a $1,200 or $1500 price. So that's indicated by the .6843 in this marginal probability. So we can look within each of these marginal probability panels to look at what the preference would be on the basis of a given factor, like price or speed, and also look across the effect marginal panels, and specifically look at the differences in marginal probability or marginal utility to see which factors are most differentiating in terms of driving the purchase decision. Okay. So the next thing I'm going to talk about is the utility profiler. So in addition to the response probability or likelihood of purchase, we also have something called utility. And so utility is really something like probability, but it's defined differently. And the utility profiler report shows, in effect, a measure of the buyer's satisfaction at a particular scenario. In the context of utility, a higher utility indicates higher happiness, if you will; and a lower utility, below zero, for example, indicates relative unhappiness. And so we can see from this profiler that the utility is increased. Or if you will maximize when purchasers spend the least amount, they have the longest battery life, the highest processor speed, and the largest size, which is completely intuitive in this context. And what we can do is we can think about utility very much like we can think about desirability in the context of traditional experimental design, where we want to maximize this utility function in order to maximize the buyer satisfaction. And so as I mentioned here, there is a relationship between probability and utility. Mathematically, what is it? Well, we don't get into that here, but it is articulated in JMP's documentation. And this is something that I'm going to be thinking about as part of the dialogue to this talk when it's archived in the user community. So look forward to that. Okay. So now, let's look at the probability profiler. So the probability profiler is going to be similar to the utility profiler, as I discussed in the prior slide, but it's, of course, a little bit different. So what is it, practically? Well, the profiler set at these particular settings of X, the response probability is 12%. So the way we can interpret this is we can say that 12% of buyers would consider to spend $1,500 to get a laptop with a 40 gigabyte disk, 1.5 GHz processor and a four hour battery life. Okay. So the way we like to think about this is that for any special condition that you want to know what the probability is at a specific set of factor levels, the probability might be more useful than, for example, utility, which describes a measure of the buyer's overall satisfaction. And you'll notice clearly that the profiler is limited to just two levels. And again, this goes back to the nature of choice design and consumer research. And we really want to h one in on the buyer's interest by giving them a successive series of dichotomous choices to choose between. Like how when we're at the optometrist, the optometrist does the lens flipping and says "Is A better or B better?" A or B and then you go on to the next one, and then she asks for a similar selection between A or B. Okay, so the next part of the project is really around warranty consideration. So I spoke about this a little bit in the beginning. So let's go into this in a little bit more depth before we conclude. So suppose the optimal choice for the consumer was a laptop sold at a price of $1,000. Suppose that the consumer purchased an extended warranty protection plan. If they were to purchase one for two years duration, it would be $102 and for three years, it'd be $141. One of the key questions for the consumer is, well, after they buy the laptop, do they buy the warranty as part of the purchase right after the laptop or the extended warranty, or do they wait, or do they just not buy the warranty at all? And I think it's pretty obvious to everyone that if you buy a higher price laptop, your warranty is going to be a higher price more often than that. So this is also a consideration in terms of making the final purchase decision. We didn't necessarily incorporate in this hypothetical experiment. Okay. So this slide just goes into a little bit more detail about warranty. There's a lot of information here and I'll let our audience take a look at it later. But a lot of this comes from our typical computer manufacturers website, like Dell. And like anything, there's a lot of language that is specific to an original limited warranty and different terms and conditions that are associated with the warranty. Some things that I think I should highlight are this idea of onsite services. The fact that more and more now, we have services where a service provider will come to you and do repair or exchange of the product at your location. There's also the customer carry-in and mail-in services, very popular now. Even within the context of Amazon, mail-in services been around for a long time. Customer carry-in is now a part of Amazon's ability to provide service in the context of going to, for example, a Whole Foods Market for return. And then, of course, product exchange. But these onsite service models and mail-in service models are very interesting and much more common today. The other thing worth highlighting is that limited warranty services depend on how you purchase your original warranty policy. So if warranty services are limited, those services are stipulated in the original warranty policy. Okay. So let's talk a little bit about the JMP's reliability and forecasting. So in the context of this particular analysis, we're going to use the life distribution platform in JMP to do some reliability and forecast calculations. And what we're going to try to do is predict future failures of components or that be into the computer as a system to help us get a better sense of whether we should purchase a warranty and at what time in the life of the product. So we can think of this analysis in terms of what we call reliability repair cost. So we want to compare with the extended warranty protection. We want to compare the ready initial warranty to the extended warranty protection. And of course, if the failure rate is too high, then we won't want to purchase warranty protection at all. And if the failure rate is very low, then there would be no need as well for us to purchase any warranty because we would have a product that lasted for a long, long time. You can think of each product or each computer, in this case, has its own unique reliability model. And it's something to think about in this context. So if you've made a particular purchase in this hypothetical scenario, you can construct a reliability model on the basis of this particular computer. Okay. So this slide talks about a nice visual representation of a reliability lifecycle, if you will, or failure rate over time for a particular product. So we basically have three phases. We have a startup and commissioning phase, which is like a burning phase for a product. Then we have a normal operation phase, and then we have an end of life phase. And so these phases can be, we referred to them sometimes as the first phase, the running phase or the burning phase, like the infant mortality phase in the context of survival for life. For clinical studies, the normal active operation phase can be thought of as the phase where random failures may happen, and the end of life phase is really like the wear- out period, the period before the product completely wears out in sales. And so corresponding to these periods, we can consider a general range of limited warranty over time, and then a transition phase somewhere where that limited warranty becomes an extended warranty protection policy. So where we go from a short term warranty, maybe a year, to a long term warranty, two or three years. So like we said, if the failure rate is very low, then perhaps if the product is expected to last more than two years, then maybe you don't need a warranty policy because you may plan to replace a product like this every two years because it's just technology and opportunity cost of price versus the benefit of new technology. So the important thing to think about again in the context of making the initial purchase is looking at the startup and commissioning of a product. If we purchase an original warranty, which is what we commonly do, it usually covers maybe a year of service and it's part of the initial purchase of the product. Okay. So now, what we're going to do is we're going to switch over to a different data set. This is also sample data which will point you to on the community. But we're going to assume that we have a database related to the Dell laptop. And it lists the return months, the quantity return, and the sold month over on the right. And so what we can do is we can graph this information. And the X is referred to the failure rate. The X is in the space of an asset for the failure rate, the reliability model, and how many parts we shipped. That's what these three variables speak to. And based on this probability, we can calculate if they purchase the warranty policy or not against the repair cost. How do we choose a warranty policy based on the failure rate or return rate? So we can use the JMP's reliability forecasting capabilities to do this. So here's a picture of the model that we use to fit this data. So we have a probability on the Y axis, first time in months. And what JMP does is it applies multiple models. So we fit all available models at least once that were non- zero, that were not producing non- negative estimates. And what we can see here is the ranking of potential models to fit this reliability data. And so Weibull is the top choice here based on the AICc, BIC, and negative 2 Likelihood ranking. So we went ahead and went with the Weibull for our analysis, subsequently. Okay. So here's an upper left hand corner of the slide. What we see is the actual Weibull failure probability model fit with its parameters, with primary estimates for beta, Alpha, scale, and location. And then the lower right hand side of this slide, what we did is we estimated the probability of failure at specific months, again, using JMP's reliability tools. And so how do we tie this analysis into something practical? Well, beta is a very important parameter. Beta is a very important parameter for Weibull distribution. So a beta less than one might indicate a product that really doesn't survive phase one; that initial where a burning phase that we showed in the bath tip curve. So in that case, we want to buy a warranty at all. A beta approximately equal to one that would be maybe a product that's in the middle of that curve, that's in that steady state period. In which case then, we wouldn't want to buy a warranty necessarily anyway in that case either, because we have a very reliable product. So we wouldn't necessarily want to invest money in a warranty when we would expect to use it. Only when the beta is greater than one, and you can see in this case, it's significantly greater than one. Do we really want to consider purchasing a warranty? In this particular example, 1.6 being higher than 1.5, probably indicates that the product is entering that wear-out period, that third phase, the productive curve. So if we look at this year, drawing on, say, an example of 35, 36 months, our failure probability is maybe 20%. So that's how we would look at this, right? As you go over and you look at the time, what would the probability failure be in the first two columns of the estimated probability output there? In this particular example, it's not completely clear whether we should purchase a warranty or not, and we likely need more information. If we had seen a beta that was in the three to four range, then that would probably suggest that we would want to purchase an extended warranty because we would anticipate that wear- out would be inevitable. In the next few slides, we're just going to derive a simple decision model to go with this reliability analysis in the spirit of the choice analysis, the choice modeling methodology that we applied in the beginning. So the consumer decision model has to consider a number of factors: survival probability at each month, failure probability at each month, which are, in effect, the same thing. And the market laptop value each month between one year and three months, of course, price depreciation happens, and the monetary loss if not purchasing the extended warranty protection if the repair is needed. And then we want to compare that monetary loss versus the expense of purchasing the warranty. So there's like a cost benefit analysis or a risk analysis that we're making. So we show here, in this slide again, the months after purchase, and then we show the survival probability at a particular month, survival probability of the prior month, and then the conditional failure probability at that month. And so we just use simple conditional probability to calculate that column that I should have indicated with the two all the way over on the right. And what we're doing here is we're using the previous Weibull estimate at each month to calculate the conditional failure probability at the subsequent months. All we're really doing is using the Lag function to generate column one. So it's just the difference between each survival probability in month, let's say, 13 and it's prior month, 12, and 14, and 13, and so on. And then we're just taking the survival probability of the prior month, subtracting the survival probability at the current month, times the survival probability of the previous month. This is conditional probability. Okay. So let's talk about the market laptop value, which is really the second thing we discussed in this extended warranty protection slide. So we can create a simple linear model to model the declination of value over time. And that's, in fact, what we did. And the slope on that model, we use four points to calculate the slope. And the slope on the model indicates the percent drop every month on average. You can see the slope is about 1.3%, so we expect about a 1.3% drop per month on average. And note here that we only really care about the declination after 12 months and beyond because the first twelve months are typically covered under warranty. Okay. So how do we model the cost of not purchasing the extended protection? Well, we can compare to a two year warranty policy, failing at two years as the worst case. And we can see that if we look at extended warranty protection plan at two years versus three years, $102 versus $141, we get a Delta of around $45. Similarly, the cost if not purchasing the two year production plan is $48 at two years versus $95 at three years, and so that difference is on the order of $40 to $50 as well. So what this shows us is that there's maybe a $50 gap between the warranty plan and the estimated cost, and that may be attributed to the services and other fixed costs. But really to make the best decision about whether to purchase a warranty or not, we want to consider the cost of not buying a warranty in this framework, and the magnitude of beta together to make the best decision about whether to purchase or not. Okay. So this is nearly the end of our analysis. And I wanted to just highlight one other thing here. So we can use a forecast capability here to show us how can we determine the return rate, what resources do we need, and a lot of that depends on the performance of service. How good service is. If we have too many returns, if we don't forecast, we may not have enough technicians to do the work. So this is the type of analysis where we're considering the producer. This is from the standpoint of the service provider and the producer. Whereas in the prior analysis, we are considering everything from the perspective of the purchaser or the consumer. So this is really a producer cost model. If they don't purchase the warranty, what's the labor cost, and what's the material cost? So the labor costs to handle all the repairs, and the material cost to replace parts for repair. And so we can see that there's a slight upward sloping trend on the long term repair forecast, and that trend really tells us what the value proposition is. So as a manufacturer, you may be making revenuein the beginning, but then you may lose money in the long run if you're doing significant repair work. Or as I said before, you may not have the capacity to do the repair work that you're obligated to do because of the liability problems with the product. Okay, so just to conclude here. I just wanted to share an overall key learnings here of our project. We use the STEM or STEAMS framework to really break up this project into a number of different elements and apply an interdisciplinary framework. And we use Choice Design, and to help consider survey design methodology as well as an analysis of survey data. We also augmented our design with a reliability performance model to qualify our purchase and whether or not that was a good purchase. Of course, the project in the context of the co- authors, including Mason Chen. Chen was very useful for motivating high school students at SOHS, and teachers to learn new methods. The final thing is one thing that we could consider in the future is increasing the number of levels to choose from, which would bring our model into more of a traditional modeling framework. A modeling framework that's more like a lease regression model or another particular popular modeling framework that looks continuous data. And so in closing, I just wanted to highlight our references and our statistical details; we're going to definitely provide those to you. Thank you very much for your time and I look forward to any questions.
CLECIM® Laser Welding Machines Finding the optimal parameters for laser welding of steel plates with JMP Stéphane GEORGES R&D and Data Science Project Manager, Dept. of Technology and Innovation Clecim SAS., 41 route de Feurs, CS 50099, 42600 Savigneux Cedex, France Purpose – To be considered good, a weld bead must meet two criteria: it must be free of defects (such as spatter, humpings, underfill, holes, etc.) and resistant (assessed by means of an Erichsen-type cupping test). The search for the optimal parameters for laser welding steel plates is already extremely demanding due to this double constraint. But if, on the top of that, you consider the productivity of the processing line and the quality of the incoming material, then the task becomes a challenge! Approach – And that is precisely this challenge that was overcome with the use of JMP. To achieve this result, many steps were implemented, all of them requiring the use of a JMP platform or feature: Base material strength analysis, qualification of the two plates to be welded [Graph builder, Map shape, ANOVA, Dashboard] Synthesis of the visual observations, production of the weld defects map, which determines a study area of irregular shape where the weld seam is flawless. [Graph builder, Multiple pictures hover label] Weld strength analysis and optimization on a non-homogeneous material and on the defined defect-free zone, given as a set of candidate points. [Custom design, Split-plot, Covariates, Uncontrolled factor, Fit model, Prediction and Contour profilers] Findings and Value – For the given material, the objective was achieved since all the steps allowed to propose and validate a set point with a maximum productivity and a good weld, both defect-free and resistant, JMP being pliable and able to adapt to all the constraints of the process and the material.   Key words: JMP, laser welding, design of experiments, DoE, covariates, LW21M, LW21H   Hello everyone. Thank you for attending this experience sharing session on JMP. My name is Stephane Georges, I'm R&D Project Manager at Clecim, and I'm very keen on data science. Today, we'll talk to you about the design of experiment methodology, and more specifically, about the case we encountered when trying to find optimal parameters for laser welding process. During this presentation, I will show you how we use JMP for various platform and how JMP adapted to the reality of the field by taking into account a very irregular study area and a very imperfect study matter. Without further delay, I will start my presentation by telling you who we are and what we do. Clecim is an engineering and production company of equipment for the steel industry. We are located in Montbrison in France in the surrounding of Lyon. The history of Clecim is not new, as we celebrated five years ago our 100th anniversary. The area of the site is about 12 soccer field where work 230 employees, mainly composed of managers and technicians. A s we like statistics when working with JMP, here is the first one. Our population is composed mainly of men, about 80 percent and 20 percent of women. Concerning Clecim activities: Our first activities is studies and consulting activities for our flat steel producer customer. We supply individual machine, or we supply a complete production line such as pickling line, annealing line, galvanizing line, painting line and so on. We also have an activity of services for the furniture or spare parts, for export missions, maintenance, and so on. I put on this picture a typical layout of a galvanizing line. This is to give you an idea of such processing line. This one is dedicated to the automotive market and the length of such an equipment is about half a kilometer, so a very huge industrial plant. When I talk previously about a machine, I had in mind rolling equipment such as rolling mill, plate levellers, automated strip surface inspection system, and even laser welding machine. This is on this last equipment that we are going to focus on right now. I will now talk to you about autogenous laser welding process. Autogenous means without filling wire. I will talk also about the parameter and factors that govern this process. But first of all, I would like to introduce our machine, the subject of our study. On the left part of the slide, you can see our welding machine. In fact, not only our welding machine but it's containment, the machine is inside the containment for safety reasons because we are using laser. The dimension here of the door gives you an idea of this welding machine of scale one. This is a huge industry, our welding machine. On the right part of this presentation, you can see a partial inside of this welding machine, where you can see the clamps and the [inaudible 00:04:37] of the machine. Inside, you see the top portion of the strip, the head and the tail of the strip, that will be, first of all, cut also with laser and finally brought together in order to be welded. I will now talk to you about our target and constraint. Of course, our objective, our target is to have a good weld. To do that, we need to achieve two objectives. The first one is to have a weld seam which is defect- free. Here on this slide, I put you an example of such a weld. This is picture number one. You can see on this picture that this whole thing is quite nice without any defect. When I talk about defect, here is a list of the typical defect that we encounter when trying to weld with a laser. Typically, we could have some patterns. This is picture number two. This is a top view. In such a case, this is the molten material, which is ejected from the top of the weld. We could have also chain of pearls. This is picture number three. This time this is a bottom view, and this is some droplets at the bottom of the weld seam. We could have also other defects, such as humpings, underfillings, or even holes. This is picture number four here. And here, typically, this is the case when we have a very low travel speed and a very high power density. Instead of welding, we are drilling at the material and we create some holes. Of course, this is we absolutely want to avoid. Otherwise, we will decrease the resistance of our weld. This is a transition for the second objective because we want not only the weld seam to be p erfect, but we want it also to be resistant. This is evaluated via an Erichsen type cupping test, so this I will describe it a little bit later. Our target is to have a trend as close as possible to the one of the base material. I will now talk to you about the last welding parameters, the factor governing the process. On this left part of this slide, I put you a very schematic view of the process. In gray at the bottom, you can see the two pieces of material that we want to weld together that can be of the same nature and thickness on it. In yellow, this is laser welding head that is connected in blue to its laser source. In order to imagine the kind of power that we use for such an application, imagine that when you use a laser pointer, typically for a presentation, such kind of device has a power of just one milliwatt. Here the last source we use has a power of 12 million times. It is 12 million times more powerful that's such a small device. Just to explain that we have a very huge power. We need very little power to cut our material and to weld also this material. On the right part of this presentation, I could use the process parameter, the typical process parameter, which can be, first of all, the laser power, the travel speed of the welding carriage, the focusing distance, the gap between the plates, the thermal treatment that we can apply afterwards, and so on. But in fact, for simplicity reason, in the rest of this presentation, we will focus only on the two main one, which is the laser power and the travel speed. We will also consider that the materials are identical and of the same thickness. You will see that just with these two parameters we will have enough to do. Okay, so the picture is set. We have two targets. One is to have a weld seam free of defect and we want also to have its resistance. We are now going to focus on our case study towards a good weld. Our first target is to have a weld , which is defect- free. We are going to search for what is called weldability lobe. To do that, we need to get some data. To get some data we will use the so called Power JMP procedure. In that case, nothing to do with JMP, even if JMP is a powerful software. But this is how the procedure is called. The picture on the bottom gives you an example of such a procedure. At a fixed speed, we will perform 11 successive Power JMP. In that case, we will switch from two kilowatts to eight kilowatts to three kilowatts and so on. The target is to reduce the number of options we have to do, and in just one weld, we will have 11 sample and we will have 11 observations to do. Afterwards, we will visually examine the upper part of our bead, the lower part of our bead for each slot of this sample. All of these data are collected into JMP and we will use the Graph Builder platform in order to display this map. This is what I'm going to show you right now. I go to my JMP journal here. We will have four steps to follow. This is our first step, building the welding format. I will open my table. I collected all the data in this table. I have all my parameters here, so the laser power, the welding speed. In this column, I inserted my visual observation. Is there any penetration? Yes/N o. Do we have material loss? Yes/N o. Humping? Yes/No, and so on. A t the end of this file, I also put, as you can see, I had two additional columns of expression vector type where I have inserted the picture of all my observations. As you can see here also, I requested to have this information displayed in the study area. Now, we are ready to open our Graph Builder. Here I can learn it, okay? All the data has been collected in this map. Here you can see that on the X- axis, I put my welding speed. On the Y- axis, I have my laser power. I have associated for each defect color or shape. A lso, we can have an association of the color and shape in that way, which is a convenient way because we can overlay four different type of defects at the same time for each point. This is what I'm going to show you. For instance, if I take this blue point here, according to the legend, we have top spatters. This is exactly what my pictures show you. Here, this is the picture of the upper part of my weld, here is a picture of the lower part of my weld. Here we can see that we have effectively top spatters, whereas the bottom part is defect- free. For instance, if I take another one, if I take this purple one here. So purple, this is the association of the blue defect and the red defect. We have top spatters and bottom spatters, which is effectively what we can see here on that picture. This is a convenient way to see if effectively. I have no mistake in order to know the magnitude of the defect. This feature using the pictures is also very convenient because, for instance, if I take this point here and I lock the pictures, and if I take this additional point here and I lock also the pictures, we can see that for constant laser power, I can compare the pictures and see what are the effects when varying the welding speed. In that case, when we increase our welding speed, we can see that the width of our weld seam decrease both on the top and both on the bottom. This is a convenient way, let's say, to dip into the understanding of our process. I will close that. I'll come back to my study case. We are interested in the good weld area. This is the area that I'm going to highlight here. This is the black area. Okay, like that. Okay, this is our area of interest. Now, what we want to do, let's say, to investigate the behavior of our weld seam from the resistance point of view on that typical area. Of course, we want to do it with a minimum number of tests and we will perform a design of experiment on this very irregular study area. Okay, so I go back to my presentation. But before answering into the conception of the design of experiment, we have another interesting topic to do with JMP, because we have, first of all, to study the strength of our base material. We have to evaluate the basic strength of the material via an Erichsen- type cupping test. Why are we doing that? We have three targets, three objectives. The first one this is to establish a reference from the strength point of view. In that way, we will be able to compare the resistance of our base material with the resistance of our weld seam. This is the first point. The second point is to be able to compare the two pieces of material that our customer sends us. We want to be sure that these two pieces have the same behavior. For that, we have to ensure that they can be comparable. The last point is that we want also to check that the plates are homogeneous from the resistance point of view and that they do not present any resistance profile in their widths or in their lengths. To do our Eric hsen-type cupping test, we will do that on the base material. This is what is mentioned, what is highlighted here in the three first pictures. We do not have any weld. We're just performing this Erichsen- type cupping test. For simplicity reason I will call this procedure ball test in the remaining part of the presentation. These three ball tests, we do them on three different positions of the material, one located at the center of our sample, one located on what we call the drive side on the machine, and one located on the operator side of the machine. For one sample, we do free tests. What is a ball test? In fact, this is explained in the pictures located at the bottom of this slide. We simply take a bowl made in titanium, and we will press it from the bottom and we will register the deformation of the material, and we will register the breakage force. We do that for our two plates, and we register all of that in JMP and analyze the results in JMP. We will use the distribution platform and the Fit X by Y platform. This is what I'm going to show you right now. I go back to my JMP journal. This is our second step, analyzes the base material. I will open my file. Here, I put all my data. My plate ID, plate number 1 and 2. For each plate I do that twice. For each sample, we perform the measurements at three different locations, and here are the recorded value, the recorded strength. We will analyze all of that and I store everything into a dashboard. First of all, this is interesting to see visually our result. I will focus, first of all, on this custom map shape. In blue, you have the data for the first plate, where I have my first sample and second sample. For each sample, I have my free ball test, one located on the operator side, one located in the center, and one located here on the drive side. We have the same for the second piece of material. What we can see here is that the resistance goes through the following rates, so from nine to 10.4 tons. Nothing really particular to see, except maybe that here in the operator side, we can see that we have on the same side the external value. Here is the lowest value and here is the maximum value, so maybe that will be something to look at but we will come to this a little bit later. Our target is to perform an ANOVA in order to see if our two plates can be considered as comparable. But before doing an ANOVA, we need to ensure that our data follow a normal distribution and that our variances can be considered as equal, so this is what we are going to do right now. Here are the distribution for plate number 1, for plate number 2. Okay, I know that I do not have a lot of data, but we will consider that we have enough to perform the test. We will look at the two Anderson-Darling coefficients here. What the p-value tells us is that we cannot reject the hypothesis that the two plates can be comparable so that's good. Then concerning the variances, so here we use another platform, but we first go at the end, we perform the variance analysis here, and we will look at the F Test, and the F Test tell us that we can consider as our variance are equal. As our data are normally distributed on our variant article, we can apply safely our ANOVA. This is what is mentioned here. On the top part, you have the drawing. Here are the associated data. I will not focus on the data. We will just have a look at the pictures. Here, what we can see, this is the extremities of the two diamonds overlap. We cannot reject the hypothesis that our two plates are different, so this is good. We will now reach the conclusion that our two plates are equivalent. We can now aggregate all the data. This is what is done here in the distribution. We put all our data together, and finally, we have a global resistance of our plate of 9.76 tons plus or minus 0.17 at two standard deviations. This is the first point and we will use this information a little bit later. Another interesting thing, so this is what is mentioned here. We can also perform the ANOVA taking into account the position. And this as we have previously observed, we can see that the run variation on the operator side is a little bit higher compared to the drive side and the center of our plate. We want to understand a little bit why such things happen. To do that, I will come back to my presentation, we will have a look at the plate. We are studying at the moment. Here is the appearance of our sheet metal. What we can see is that on the drive side here, our plate is nearly flat, I would say. But on the contrary, on the operator side, we clearly see that the plates have some waves. I do not know exactly what is the history of this material, but we can clearly imagine that there was a trouble at the rolling m ill or the plate leveller and that the higher force has been applied on the operator side, leading to this kind of periodic modification of the resistance. This is a new constraint because we have to take into account this new information in our design of experiment. I will sum up all the information we have before building our design of experiment. First of all, I remind you that we have a very irregular study area. This is a black area that is mentioned here in the drawing. The traditional way to deal with such things in JMP would be to fill up a linear constraint. But here due to the shape of this area, it's a little bit difficult. Instead, we prefer to use the Candidate Points technique, which is called also Covariates engine. I remind you also that we have an inhomogeneous plate and to deal with this phenomenon, we will have to introduce a few parameters into our design of experiment. First of all, we have to take into account the strength variation in the width. To do that, we will introduce a categor ial parameter, a 3 levels- categorial parameters, and the three levels are drive side, center, and operator sides. In order to deal with the periodic variation of the resistance along the length of the plate, well, this is a little bit difficult, because, in fact, we do not control this parameter. We have to be on this variation. For that, we will introduce the weld position from the head of the plate, or in millimeter, and we will introduce it as an uncontrolled parameter. Finally, this is not finished. This is what you can realize, what you can see in the last picture at the bottom. This is typically here a picture of a weld, where are located above three ball tests. Well, these ball tests are not independent. They belong to the same treatment. They belong to the same weld. They are at the same weld position. We are in the presence of split/ plot design, where we have hard and easy- to- change parameter. This is a lot of constraints we have to take into account. Now, I will show you how to do that with JMP. I can go back to my JMP journal, so this is our first step. But I will show you from the beginning and I will go back here to this step. I come back to the file I had previously. I will select here all the rows with a Good Weld. I will also select another power column, my welding speed column. In the table here, I will extract a subset of this table. I will extract selected line. I will extract the selected column here. Okay, I will build my subset of Candidate Points. But here is the tricky thing because, in fact, as we have a design also, I need to tell them that you will have the possibility to select three times each point. I will multiply by three this number of points. There is probably a lot of way to do that. In my case, I will just create three columns, one call it drive side, one call it center, one call it operator side, and I will just stack all these using that three columns. Here I have created my set of Candidate Points. With the Graph Builder, I will check that everything is okay. In the Y- axis, I will put the laser power. On the X- axis, I will put the laser speed. Here, we can recognize the shape of our irregular study area. I will also put here the label into the color sections. Each of my points has been multiplied by three. This is exactly what I wanted. I can use this as a starting point. This is a set of my Candidate Points. I could have additional points. I could have done a little bit all of that. I could have added some points here, for instance, and so on. But to be honest, the discretization steps are fine for me. I will keep all the points like that. I will open now my design of experiment menu, go to my Custom Design platform. In the response, I want to measure the strength of the material. In the Factors, I will add my first parameter, Laser power and welding speed as Covariate. I select Covariate, I select laser power and Welding parameter. Automatically, JMP fills up the lowest and highest value for these two parameter. I will add also my 3-level categorial parameter. I call it side, and my level drive side, center, and operator side. I do not forget also to add my uncontrolled parameter. This is the position of the weld. This is an uncontrolled parameter. I do not know the limits, so I put nothing in these boxes. I also do not forget to change here in order to take it into account the speed/plot feature. I mentioned here that my laser power and welding speed are hard-to-change compared to the side. Everything is correct here, so we can run. Concerning the constraints, I do not need this area, because I already took into account my constraints when selected my covariate, so I do not need here. Concerning the model, I immediately choose RSM. But in that RSM, I will suppress the interaction of the laser power with the position and the interaction of the position with the welding speed because there is clearly no interaction at all. I will immediately click on the Make Design button because it will take some moment. Here we can see that JMP proposed me to perform eight tests with 24 measurements. This is fine for me. I keep this default parameter. I make my table here. Okay. This is the defect concerning to laser power, welding speed, side. I have the column , I will record the position of my welding speed, where I will record the strength, so everything is okay. I will never visualize the point into the Graph Builder. I put, once again, the laser power in the Y- axis, the welding speed will in the X -axis. Perfect, and we read curve. I will also put the side here into the Color area. I will add some details. Okay, here we can see that this is the point that has been selected by JMP within the framework of our Candidate set. For each point, I have to perform three measurements, on drive side, center, and operator side, and JMP wants me to perform this test twice. Okay , so this is what we have done. I will now show you the result. I go back to my JMP journal here. This is our last step, the result analysis. I will open the associated table. Here, this is the same table as previously. I have recorded the positions. I have recorded the strength in absolute value in terms. I have inserted two columns here. This is our reference strength, the strength of our base material that we have previously determined. This is 9.76 tons, because, in fact, I will use it in order to create this extra column, this is the strength, but in percent compared to the base material. This is on this last column. This is the first column that we will take into account in our optimizations. I store my analysis into each column here. Here are the results. As a reminder, I put on the right part. the irregular area as a reminder. Here are the experimental points that we have performed. I have colored the point using the strengths in percent. I use the Fit Model platform in order to create my model. Step by step, I have suppressed the non-significant interaction of our parameter using the p-values here. Finally, I have a model with an explicative power of two of 96 percent, which is quite good because, in fact, it means that only four percent of the wall variation escape to our prediction power. Concerning the collinearities , if I look to our VIF, our variance inflation factor, all of them are below three. We can be now confident in our model and we can use it in prediction. We can go to the prediction profiler here. First of all, I will focus on this part, on the interaction of the position with the side. What you can see here, if I move the position of the weld, it seems that the resistance is not sensitive to the positions on the drive side and center. But on the contrary, on the operator side, this one is clearly influenced by the positions. This is exactly what we have seen in our plate. This is a modeling of our waves. Here we do not see this kind of shape because this can be easily explained. Our weld are not so huge. It's only a sample of six centimeters. We do not consume a lot of material. We do not go along the wall with that shape. We go from the bottom to the top only. We are quite happy because we were able to analyze and to model correctly this behavior. This is interesting because we can now have access to the pure effect of the laser power and welding speed. This is what is mentioned here. For this particular material, we can conclude that we can increase the resistance of this material by decreasing the laser power or by increasing the welding speed. Now what we have to do this is to determine an optimal point using all the information we have. To do that, we prefer to do it using the control profiler. This is what I have mentioned here. In this control profiler, once again, I put in the X- axis, the welding speed, in the Y- axis, I have the laser power. I have reproduced my area, my irregular area where I am defect- free. To do that, I have simply implement a script. Here is a list of points, and I just asked JMP to use this point and to draw a polygon. I have my black area where I am defect- free. On that drawing, I have also inserted the ISO resistance curve. Here in red, you can see the values of this ISO resistance. We can see that, here, we go from 50 percent to nearly 100 percent. Before doing the optimization, I will add another constraint. It's not enough. From the productivity point of view, we want, of course, to go as fast as possible and we want to have the highest welding speed. Using all that information, we have selected the point at six kilowatts and 11 meters per minute. This is the point that has been located here. Why? Because this point is located in the black area, where the w eld seam is free of defect. We can see using our model that we expect this point to have at least a resistance of 90 percent of the resistance of the best material. What is interesting also is at this point we have enough safety merging around this point. Of course, we have tested this point and this is the result. But I'm going to show you right now. I go back to my presentation, which is located here. This is the result of the optimal point we are choosing. First of all, on this slide, you can see on the left part, the upper w eld bead pictures, on the right side, the lower weld bead pictures. You can see that the weld seam are free of defect. We have no patterns, we have no droplet, no chain of pearl, no holes, et cetera. This is what we wanted. From a resistance point of view, first of all, if we focus on the pictures, we can see that this is the material that breaks and this is not the weld seam that opens, so this is the first good point. Concerning the resistance, we can see that each of them are higher than 90 percent. This is exactly what we wanted, let's say that we have achieved our target. This is the end of this presentation. This is now time to conclude. First of all, I go back to my JMP journal, I would like to mention that with this presentation, you have the possibility to download an article that will be located on the website. Here you have a full article explaining all the case studies. I have added additional material. If you are interested in knowing more, please feel free to download it. A s a conclusion, I would like to mention that if some of you are interested in knowing more about covariates, I would like to mention two available sources. The first one is an article that you can find on the JMP user community entitled "What is a covariate in design of experiments?" Also from the same offer, you have a webinar entitled "H andling Covariates E ffectively when Designing Experiments." To conclude, I put here a quote from Mark Twain that humorously tells us that "Facts are stubborn things, but statistics are pliable" Inspired by Mark Twain, I would like to say that facts are certainly stubborn things, meaning complex, but don't panic because, in fact, JMP can easily adapt to the reality of the field. In my case, he was able to adapt to a very irregular study area and also to a very imperfect study material. This is the end of my presentation. I will now answer your question, so please feel free to ask. If we run out of time, I mentioned here my contact information, so please feel free to contact me, and so I'm waiting now for your question. Thanks a lot. Bye bye.   1. Background and introduction Steel strip manufacturing ever reinvents itself by propositing new metallurgical concepts, requiring to tackle technical limitations of production systems. As a provider of mechatronic solutions for steel plate processing, Clecim SAS recently expanded its laser welder line with a next-generation machine capable of cutting and welding heavy plate using a 12kW laser source. Addressing the usual drawbacks in maintenance, operation and safety of current welding system based on mechanical cutting and CO2 laser welding, the newly developped LW21H (Heavy) welding machine benefits from a smarter approach by processing thicker strips up to 9 mm with solid-state laser cutting and welding. This new generation of welders, heir to Clecim SAS' 20 years of experience in welding and in particular its little sister - the LW21M (Medium) - pushes back the current limits of performance and technological drawbacks observed in solutions for thicker materials. It is materialized by a 1:1 scale pilot designed, manufactured and tested in Clecim SAS workshops [Figure 1.] Figure 1b - A partial view of the inner part of the machine. Head and tail of the two plates will be cut by laser technology and then welded together.Figure 1b - A partial view of the inner part of the machine. Head and tail of the two plates will be cut by laser technology and then welded together. Figure 1a – An external view of the containment of the heavy laser welder at Montbrison workshop. The size of the door gives an idea of the dimensions of this industrial welder.Figure 1a – An external view of the containment of the heavy laser welder at Montbrison workshop. The size of the door gives an idea of the dimensions of this industrial welder.   In 2019, the laser cutting process was extensively studied and the use of Machine Learning techniques allowed for the conception of a model able of delivering robust cutting presets across the thickness range. Today, the focus is on the laser welding process and the acquisition of high-quality data that will soon allow the creation of a welding model, the final step for a completely automated machine. To achieve a good weld, two criteria must be met: a weld seam free of defects (such as spatters, droplets, etc.), and a good strength. To reach this result on a given material, many steps have been followed: The determination of the welding flaws map and the weldability lobe, area where the weld seam is defect-free The determination of the base material strength to ensure that the pieces of material are identical and homogenous The analysis and modeling, via a DoE of the weld seam strength on the previously determined weldability lobe, which usually has a highly irregular shape Let’s now dive into the details of these exciting steps, all of them requiring the use of a JMP platform.   Notation M Material type F Focusing distance H Material thickness G Gap between the plates  P Laser power T Thermal treatment V Travel speed of the welding carriage       2. Laser welding process and factors The welding process is made of 3 parts: the two plates to be welded which can be of the same nature and thickness or not, the laser welding head mounted onto a travelling carriage and connected to its 12kW laser source. To give an idea of the delivered power, a classic laser pointer used for a presentation has typically a power of 1mW. In comparison, the laser source used by Clecim SAS to cut and weld the pieces of material is 12 million times more powerful.   Generally speaking, the influencing parameters of laser welding belong to two categories, namely those related to the material to be welded itself, such as its nature M and thickness H, and those related to the process, such as the laser power used P, the speed of the welding carriage V, the focusing distance F, the heat treatment T or the spacing between the sheets G. To a lesser degree, other parameters are involved such as the inclination of the laser welding head, the type of shielding gas, its pressure, etc. Within the framework of this paper, only the used laser power P and travel speed of the welding carriage V will be considered. The materials to be welded will be identical and of the same thickness.   To put it in a nutshell, for the given pieces of material (M, H), two factors (P, V) have to be optimized with the goal of getting a flawless and resistant weld seam.   3. Weldability lobe Figure 2 – Welding flaws map – JMP Chart Builder is used to view the weld defect map. The major defect areas can easily be recognized: partial penetration (yellow), holes (orange), spatters (blue, purple), chain of pearls (horizontale stripes), defect-free area (black). Pictures of the top and bottom weld seam are displayed in the tooltip area when moving the mouse over.Figure 2 – Welding flaws map – JMP Chart Builder is used to view the weld defect map. The major defect areas can easily be recognized: partial penetration (yellow), holes (orange), spatters (blue, purple), chain of pearls (horizontale stripes), defect-free area (black). Pictures of the top and bottom weld seam are displayed in the tooltip area when moving the mouse over.The first step of the experimental approach consists in performing tests in order to build a map of welding defects and thus determine the weldability zone, i.e. the defect-free zone. Depending on the thickness of the material, the number of tests to perform can quickly become important. In effect, the goal is to test all the pairs (P, V) and to visually observe the quality of the weld bead to know if the combination (P, V) generates a defect or not. In order to drastically reduce the number of tests and to save time, the so-called “power jumps” procedure is used. In a single trial, at fixed speed, 11 power jumps, from 2 to 12kW in 1kW steps, are carried out giving the possibility to perform 11 tests in one. Regarding the welding speed, steps of 2 m/min were used from 3 to 18 m/min. In the end, the upper and lower parts of 88 weld seams were visually inspected and qualified. The results were stored in a JMP table and evaluated using the Graph Builder platform [Figure 2.] The welding speed V is shown on the x-axis and the used laser power P on the y-axis. For a given speed, we find the 11 visual observations corresponding to the 11 power jumps of the test protocol. Thanks to the association of a color and a shape, in one combined, it is possible to represent four welding flaws at the same time and to visualize hence the major defect areas in this way. For each pair (P, V), pictures from the top and bottom weld seam have also been taken and stored into two expression/vector columns so that they can simultaneously appear in the tooltip area. By moving the mouse over the points, the pictures are displayed. This functionality allows to easily compare the influence of a factor change on the weld bead facies and thus to progressively enter into the understanding of the laser welding process.   4. Base material strength analysis Before going further in the analysis of the welds, it is necessary to evaluate the strength of the base material, and this for 3 reasons:   The first reason is to establish a reference strength so thatwe can make comparisons. The second one is to make sure that the 2 plates sent to usby our customer are comparable. And the third one is to make sure that the plates are homogeneous and that they do not have any resistance profile in their width for instance.  To do that, Erichsen-type cupping tests on plates without any welds. Stamping is done via a ball and the breakage resistance is automaticallyrecorded. The protocol provides for three measurements in the width of the plate. Positions are respectively DS (drive side of the welding machine), C (center) and OS (operator side). The various results are stored in a JMP table and summarized in a dashboard [Figure 3.] Figure 3 – Base material strength analysis The dashboard is composed of various JMP platforms: Graph Builder, Distributions and ANOVA. The custom map shape[1] of the Graph Builder displays the two samples corresponding to each of the two plates and the position of the various cupping tests colored by strength. In the ANOVA, the overlap of the two diamond tips demonstrates that the plate can be considered as identical. The chart on the right shows that the strength variance is higher on operator side (OS). Once aggregated, data from the bottom distribution presents an average strength is 9.76±0.18 tons (at 2σ).Figure 3 – Base material strength analysis The dashboard is composed of various JMP platforms: Graph Builder, Distributions and ANOVA. The custom map shape[1] of the Graph Builder displays the two samples corresponding to each of the two plates and the position of the various cupping tests colored by strength. In the ANOVA, the overlap of the two diamond tips demonstrates that the plate can be considered as identical. The chart on the right shows that the strength variance is higher on operator side (OS). Once aggregated, data from the bottom distribution presents an average strength is 9.76±0.18 tons (at 2σ).   In summary, the two plates to be welded can be considered identical, but further investigations are needed to understand why the strength variance is higher on the operator side. Figure 4 – Appearance of the plate – The plates present a relatively flat aspect on the drive side and waves on the operator side. The history of the plates is unknown but there must have been a rolling of planishing issue with a higher force applied on the operator side, which created this appearance and a periodic modification of the strength.Figure 4 – Appearance of the plate – The plates present a relatively flat aspect on the drive side and waves on the operator side. The history of the plates is unknown but there must have been a rolling of planishing issue with a higher force applied on the operator side, which created this appearance and a periodic modification of the strength.   To understand the differences in resistance on the operator side, it is necessary to pay attention to the visual aspect of the plate[Figure 4.] Due to potential force variations during its treatment, the plates are inhomogeneous in term of strength in their width and length.    4. Weld bead strength analysis The construction of the test plan requires taking into account all the various constraints, 4 in number:   The first constraint is related to the irregularly shaped region[4-Ch.5] of the weldability lobe. The traditional way to do it would be to delimit the study area using multiple linear constraints. Although possible, it is the technique of the candidate points, also called covariates[2,3,4-Ch.9] in JMP, that has been chosen for the simplicity sake. The second one, due to the plate inhomogeneity, is related to the strength changes in the width. To take this effect into account, a 3 levels (DS, C, OS) categorical parameter is envisaged. The third one, also due to plate inhomogeneity, is related to the periodic and incurred strength changes in the length. This parameter cannot be controlled but it must nevertheless be considered in the future test plan. Finally, the fourth one is related to the fact that the 3 values of the categorical parameter are not independent since they belong to the same treatment (i.e. weld). Subsequently, a split-plot design[4-Ch.10] with parameters hard or easy to change has to be considered. Figure 5 – Building of the custom design of experiments – The Custom Design platform allows the creation of a completely customized test plan. The Responses part provides the list of responses to be optimized, in this case the goal is to look for the maximum strength. The Factors part presents the way how the four constraints have been addressed. As the position is uncontrolled, no values are input into the limits. The Model part displays all the factors and interactions considered in the model. RSM (Response Surface Methodology) is used, the interactions between the laser power, the welding speed and the side have been removed as they have been considered not significant. Finally, the Design Generation part proposes 8 trials and 24 measurements.Figure 5 – Building of the custom design of experiments – The Custom Design platform allows the creation of a completely customized test plan. The Responses part provides the list of responses to be optimized, in this case the goal is to look for the maximum strength. The Factors part presents the way how the four constraints have been addressed. As the position is uncontrolled, no values are input into the limits. The Model part displays all the factors and interactions considered in the model. RSM (Response Surface Methodology) is used, the interactions between the laser power, the welding speed and the side have been removed as they have been considered not significant. Finally, the Design Generation part proposes 8 trials and 24 measurements.   The creation of the custom design of experiments is explained in [Figure 5.] A total of 8 tests and 24 measurements is lastly considered. The test plan is executed and for each triplet (P, V, Side) the following data are recorded: the position of the weld (in mm, from one end of the plate), the value of the strength (in absolute and in percent of the base material strength). The strength of the welds is then modeled using the Fit Model platform [Figure 6.]   Figure 6 – Building of the custom design of experiments – The results are presented into a dashboard. The irregular shape of the weldability lobe is reminded in the top right chart. The experimental points, proposed by the Custom Design platform, and the associated strength values, in percent, are summarized in the bottom right chart. Finally, the Fit Model platform on the left displays the modeling result. An explicative power R2 of 96% has been reached, meaning that only 4% of the variations escape its predictive power. The Effect Summary shows that the main effects (laser power, welding speed and position) are significant. The side factor is not directly significant, but becomes so when associated with the position. The VIFs (Variance Inflation Factors, not displayed here) have all a value smaller than 1.6, showing no multicolinearity issue (no linear relationship among two or more explanatory variables exists).Figure 6 – Building of the custom design of experiments – The results are presented into a dashboard. The irregular shape of the weldability lobe is reminded in the top right chart. The experimental points, proposed by the Custom Design platform, and the associated strength values, in percent, are summarized in the bottom right chart. Finally, the Fit Model platform on the left displays the modeling result. An explicative power R2 of 96% has been reached, meaning that only 4% of the variations escape its predictive power. The Effect Summary shows that the main effects (laser power, welding speed and position) are significant. The side factor is not directly significant, but becomes so when associated with the position. The VIFs (Variance Inflation Factors, not displayed here) have all a value smaller than 1.6, showing no multicolinearity issue (no linear relationship among two or more explanatory variables exists).   The resulting model being of good quality, it can be used in prediction. After correcting for the effects of weld position and sides, the trends attributable to laser power and traveling speed are clearly visible in the Prediction Profiler [Figure 7.] Figure 7 – Checking the model's behavior with the Profiler – The upper profiler refers to low position values, the lower to high position values. The model response (strength in %) is shown on the y- axis, the factors on the x-axis. Weld after weld (increasing position), the strengthes on the DS and C sides remain mostly unchanged while the strength on the OS side changes dramatically, as observed visually. This phenomenon being well modeled, it is now possible to access the pure effects of laser power and welding speed. The weld strength increases when the laser power decreases and the traveling speed increases.Figure 7 – Checking the model's behavior with the Profiler – The upper profiler refers to low position values, the lower to high position values. The model response (strength in %) is shown on the y- axis, the factors on the x-axis. Weld after weld (increasing position), the strengthes on the DS and C sides remain mostly unchanged while the strength on the OS side changes dramatically, as observed visually. This phenomenon being well modeled, it is now possible to access the pure effects of laser power and welding speed. The weld strength increases when the laser power decreases and the traveling speed increases.For the considered material, there does not seem to be any interaction between the laser power and the welding speed. The weld strength therefore increases with the travelling speed and when the laser power decreases.   However, that being said, the work is not over yet. The limits of the weldability lobe must also be carefully considered in the search for an optimum. On the basis of the Prediction Profiler alone, this is not easy, so it is the Contour Profiler's turn to play!   The use of the Contour Profiler allows to superimpose the iso-resistance curves from the strength model with the weldability lobe [Figure 8.] Finding the optimal point requires locating a point that is not only within the weldability lobe but that also has the highest strength. Figure 8 – Optimization with the Contour Profiler – The Contour Profiler displays the welding speed on the x-axis and the laser power on the y-axis, with the position and side values fixed. Arbitrarily, the side value was set to DS. As for the position, it was set to the latter. The weldability lobe where the weld bead is free of defects was reproduced in black using a script and the polygon drawing function.The iso-resistance curves of the model, in red, are also plotted. The associated resistance percentages are also displayed in red. The welding speed and laser power sliders are set to the coordinates of the optimal point, materialized by the black cross in the center of the graph.Figure 8 – Optimization with the Contour Profiler – The Contour Profiler displays the welding speed on the x-axis and the laser power on the y-axis, with the position and side values fixed. Arbitrarily, the side value was set to DS. As for the position, it was set to the latter. The weldability lobe where the weld bead is free of defects was reproduced in black using a script and the polygon drawing function.The iso-resistance curves of the model, in red, are also plotted. The associated resistance percentages are also displayed in red. The welding speed and laser power sliders are set to the coordinates of the optimal point, materialized by the black cross in the center of the graph.   If we add to this the fact that the welding speed should be as fast as possible for a maximum productivity, the coordinate spot (11 mpm, 6kW) proves to be ideal. Not only does it meet all the above criteria, but it also offers a satisfactory safety margin for an industrial process. Of course, these settings have been tested. The results are presented [Figure 9 and Figure 10.] In summary, the weld seam has a defect-free surface with a strength across the entire width comparable to that of the base material. The objective has been achieved! Figure 9 – Optimum preset and weld seam appearance – The figure shows the upper (left) and lower (right) weld bead facies. The latter are free of the main welding defects.Figure 9 – Optimum preset and weld seam appearance – The figure shows the upper (left) and lower (right) weld bead facies. The latter are free of the main welding defects. Figure 10 – Optimum preset and weld strength – The figure shows the results of the 3 Erichsen-type cupping tests performed on the DS, C and OS sides. Visually, it can be seen that it is the material that breaks and not the weld. Moreover, all the tests show a strength level comparable to the base material one.Figure 10 – Optimum preset and weld strength – The figure shows the results of the 3 Erichsen-type cupping tests performed on the DS, C and OS sides. Visually, it can be seen that it is the material that breaks and not the weld. Moreover, all the tests show a strength level comparable to the base material one.   5. Conclusion Finding the optimal laser welding parameters for a given material is not easy. Fortunately, JMP offers a suite of platforms that, in one combined, provide a rigorous approach to achieving our goal.   The use of the Graph Builder, Personal Map Shape and Dashboards allowed us to visually organize our data both in terms of welding defects (welding flaws map, weldability lobe) and strength (Erichsen-type cupping tests).   The ANOVA and Distribution platforms were used to make informed decisions about the equivalence of the plates to be processed and their level of strength.   Once the weldability zone was determined (zone where the weld beads are free of defects), the strength of the weld bead was studied using the design of experiments methodology. In this paper, only 2 parameters were considered (laser power, welding speed). The Custom Design platform allowed for a high degree of customization of the tests in relation to the encountered constraints. The highly irregular shape of the study area gave the opportunity to use the candidate point method (covariates), in addition to other features such as split-plot design and uncontrolled factors.   The modeling of the weld strength via the Fit Model platform allowed not only to understand the involved physical phenomena but also to proceed to a multi-criteria optimization via the Contour Profiler. Finally, the objective was achieved since all the steps allowed to propose and validate a set point with a maximum productivity and a good weld, both defect-free and resistant.   This step is part of an extensive, very high-value data acquisition program that will allow, just as it did in 2019 with the laser cutting process, the development of a laser welding model that will provide robust welding instructions, regardless of the incoming product, the final step to fully automate the machine.   6. About Clecim SAS Clecim SAS, based in Montbrison (Loire), joined the Mutares group on April 1, 2021. It is an engineering and production company, bringing its expertise in services and manufacturing in particular for the metallurgical industry. Its main activity is the operational support of the performance of its flat steel producer customers, in particular for the automotive market. This support takes the form of studies and advice on the improvement of their production tools, the supply of special machines to optimize performance and, if necessary, the supply of complete production lines based on the latest technologies.   For decades, Clecim SAS has been promoting innovation in the steel industry and is constantly looking for new solutions to provide metal producers with state-of-the-art equipment, allowing them to gain a competitive advantage. Our latest areas of focus include new technologically differentiated solutions, advanced process analysis and optimization. Of particular note in this area are world-renowned high-level solutions such as special laser welding machines, surface inspection systems, rolling equipment, and galvanizing lines for flat steel for the automotive market.   With its own factory, Clecim SAS is able to manufacture and test complete machines. The company has many skills (engineering, manufacturing, testing) allowing it to master the entire value chain. Clecim SAS can also provide its customers with a pilot rolling mill for the development and confirmation of flattening, rolling and tribology models. Figure 11 – Clecim SAS (Montbrison city, France)Figure 11 – Clecim SAS (Montbrison city, France) About the author Graduated from Grenoble INP in Materials Science and Process Engineering, Stéphane GEORGES (47y) joined Clecim SAS in 2001. After holding several positions in the Automation, Modeling and Process Control department, and after 3 years of expatriation in Erlangen at Siemens in Germany, Stéphane moved to the position of R&D Project Manager. His various missions related to industrial processes lead him to use his skills in smart experimentation, statistical analysis and modeling, coding, machine learning and deep learning to make the most of data.   Acknowledgment We would like to express our thanks and gratitude to Florence Kussener of JMP for her support in using the software and preparing us for our first ever participation in the Discovery Summit.   References [1] André Augé, JMP Addict: Tips and Tricks Workshop: Customize Your Reports and Chart Builder Tips, Webinar, 2021, shorturl.at/jqAZ2 [2] Lekivetz Ryan, What is a covariate in design of experiments, article, JMP Community, 2021, shorturl.at/hqvH4 [3] Lekivetz Ryan, Developer Tutorial - Handling Covariates Effectively when Designing Experiments, webinar, JMP Community, 2021, shorturl.at/fvEZ0 [4] Peter Goos, Bradley Jones, Optimal Design of Experiments: A Case Study Approach, Wiley, 2011   CLECIM® is an internationally registered trademark owned by company Clecim SAS. Clecim SAS all rights reserved – 2022-feb-02
Sepsis is a life-threatening condition which occurs when the body's response to infection causes tissue damage, organ failure, or death. In fact, sepsis costs U.S. hospitals more than any other health condition, and a majority of these costs is for sepsis patients who were not diagnosed at admission. Thus, early detection and treatment are critical for improving outcomes. This presentation examines an actual clinical data set, obtained from two U.S. hospitals and recently published on Kaggle. In particular, a number of predictors, drawn from a combination of vital signs, demographic groups, and clinical laboratory data are examined. Using JMP, such issues as missing values, outliers, and a highly unbalanced, categorical outcome variable are dealt with. In addition, this presentation shows how visualization, interactivity, and analytical flow can lead to a more compact and integrated analysis — and a shorter time to discovery.       Good morning. Good afternoon. Good evening, everyone. My name is Stan Saranovich, and I am the principal analyst at Crucial Connection, LLC. And I am located in Jeffersonville, Indiana, right across the river from Louisville, Kentucky, United States of America. And today I'm going to talk about sepsis predictions from clinical data using JMP Pro 16. But almost all of what I am going to do here will be available in the standard version of JMP. Now let's talk about sepsis for a minute. To start off, sepsis is a life threatening condition which occurs when the body's response to infection causes tissue damage and cause organ failure or even death. In fact, sepsis costs United States hospital more than any other health condition. So if we could predict sepsis and detect it early, we could improve the outcome of critical care patients and also lower the cost of health care. So we're going to look at a data set today, and this is an actual data set of clinical data. It was collected from two hospitals in Boston, Massachusetts, area in the United States. And these two data tables were published as a contest on Kaggle by a Cardiology group, and the results were eventually published in the Cardiology Journal. Now, there were three units involved in this study, and in this context, I have the data for what we'll call unit one and unit two, which is data from two ICU intensive care units. The third group of data was not made publicly available and was held back for the contest, and it's still not publicly available. So what we'll do now is examine the data and see if we can predict sepsis and what variables we should be following to avoid sepsis, which, of course, can be a life threatening condition. Now I have a data set in front of me, and this was downloaded from the Kaggle site and imported into JMP. And let's take a closer look at it. Usually I like to JMP right in to the data analysis, but in this particular case, that's not going to be a good idea. And after we look over the data, you'll see why. First of all, let's look over at the left of the JMP data table, and we have the columns window. I prefer to do a lot of the work from here and here we see a number of variables. So let's start with heart rate right here. That's going to be an important predictor, probably, but we don't know that. And we have the rest of the predictors over here. And there are actually 40 of them. Well, no, 38, if you don't count the units, two saturation, et cetera. And we could go through the list right here temperature. But what you'll notice and for this, I'm going to have to scroll is that the first six or eight columns are just clinical data. We have systolic blood pressure. We have respirations diastolic blood pressure. And as we go across our data table, we see some lab data. We're talking about glucose and lactate levels, magnesiums, phosphate, potassium, bilirubin, et cetera, et cetera. And finally, we have a set of columns, which I guess we could call it demographic data. We have the age right here, gender, and what unit they belong to. Now, while we're on units, let's take a look here. We have unit one and unit two. And we know from doing our background research and we all do background research. Right before we start the analysis, we know that one is a Cardiology unit and the other one is a surgery unit. So if they're in unit two, we have a one standard practice, and if they're in unit one, which they're not, we have a zero. Also standard practice. But notice something else here. We have a lot of rows where there is no unit. Now, we don't know where that data came from, not unit two. And the background that was published on the cable site tells us that the third unit is going to be held back to score the model. So it wasn't made public. So we don't know where that data came from. And I'll discuss that in further detail in a little bit. Now let's scroll back over and look at some other things about this data table. We have a whole lot of missing data and some of them let's take a look at some of the columns here, and we can just scroll down. Here the billerobin direct. There's one we had to go down 63 rows before we found one set of data. Here's another one at row 113. So there's not a whole lot of data in there. As a matter of fact, that column is only 3% populated, and there's a whole lot of other columns that are populated at a similar rate. That was the worst example, but they're a whole lot of five and ten. And we also have the problem with submission unit assignment. So let me close that data table and I will open another one. There we are. Now, I made some modifications to that first table and I decided to save some time and not make you sit through and just watch me clicking on columns. Notice over here in the columns area, we have these two symbols right here. One hides columns and the other one excludes them from the analysis. And if you'll notice this one particular one, et CO2, it was right about here in the first data set and it's missing. So we hid them and we're going to exclude them from the analysis. I also did two things which it's just personal to me. Number one, I moved our target variable. What we want to predict is sepsis to the left, so that when I view the data table, I can just scan across the rows and see some relationships if there's something I want to see. And they also moved this one over because I knew this just from all, for lack of a better term, general intuition that this was going to be an important variable and that's ICU, Los, which is intensive care, unit length of stay. Now let's look at some other things around here. I want to note one other thing I excluded. Where was it? Right here. You can't see it hospital admission time. And what that is, we think, is the time between they were admitted to the hospital and the time they were admitted to the ICU. And a lot of those numbers are negative, but there's nothing in the documentation that tells us how you can have a negative time. So I excluded those also. And the other rows that I excluded, like the bridge in there right above it. It's the same deal with those. There's a lot of missing data, so I just excluded those. And let's see, is there anything else I need to do here? No, it's time to do the analysis. But before we do that, let me tell you what the overall plan is going to be. And that is after we examine the data and clean it and prep it, which we've already done, we're going to look at the individual units. We're going to look at is sepsis, in other words, whether or not the people develop sepsis, and we're going to do some database management. So let's get started. The first thing I'd like to do is go up here to Tables and then go to Subset, and we get the pop up window from JMP. It says create a new data table, et cetera, et cetera. So let me click on that. And of course, I want to check this box here that says Subset, Buy. And I'll Scroll down because we knew there were two units and there was also, in effect, a third unit, which was neither unit one or unit two. So let me just click on unit one, and I want to pull in all the rows from that one. And I'd like to keep a Buy column just as a safety check, and you'll find out why in just a minute here. And we could keep the dialogue open. But I've run through this analysis before, so right now, hopefully I won't make a mistake. It won't change my mind. And there'll be no need to keep the dialogue box open. And for right now, we'll skip the output table name, and I don't want to link it to the original data table, but I'll come down here to this box and save the script to the source table and take one last look over this and everything looks okay. And I'll click the okay. Now let me separate these a little bit. Remember, we had unit one, unit two, and missing, which was neither unit one or unit two. And I've got three data tables here. And at the titles, it says unit one, JMP probe. Well, let's just look at that. It says here unit one. And if I Scroll over, it says unit one and unit two. And this is the reason I kept the Buy columns right here. It's all missing, dad. And we scroll down a little bit just to double check. And yeah, it looks like they're missing. So it says unit one. So what we want to do is relabel that right click, hit, edit, and we can change that to missing. And I'll type it in and we'll hit okay, and we now know what that is. It says unit one equals I mistyped it should be missing, but I'll leave it go for right now because we can't analyze that because we don't know where it came from or rather, we don't want to analyze it. So I'll just close that out and I don't want to save the changes. Now, this one says unit one equals zero. And I'll come over here and yeah, sure enough, there's zero in unit one, which means it does not belong to unit one. And over here it says unit two and there's one in the columns. Scroll down a little bit just to make sure. And yeah, it says unit two. Okay, so what we'll want to do is go up here and we can click that and we could edit one, edit it. And now I want to change that and we'll do that. And here where it says subset script. We can go up here and I want to change this. I can edit that and I'll change that to unit, too, but it won't do it right now. And it says right here, if we want to check, we're looking at unit and right here it says keep by column one, unit one, and that's the column it is. So I'll just cancel out of that for now. And it's the same with this data table over here. We can go to source code, or rather the source edit. It says keep by columns one, columns one, and we'll cancel out of that. And that's it for the units for right now. Now, what we could do is go up here and do another table platform. We've got summary and subset and whatnot we could click on summary, and we could drag what we want to summarize in here and we could pick our statistics, whether we want to count in the mean, standard deviation, the medium, excuse me and whole lot of other stuff, but we'll skip that for now and let me cancel that out and I'll close these two tables and I won't save them because I have a couple of tables down here where I already did this. So let me take come up here and I'll Select unit one and unit two and I'll open them. Now, if we wanted to see the difference in outcomes, for example, between unit one and unit two, we could analyze both of these status sets separately, but in the interest of time, we will not do that. What we will do, we combine them and analyze them together. Now, let's show another feature of JMP that makes the data handling part of our job easy. And let me go up to tables and what we want to do here is concatenate and the helper window shows up. It says combines rows from several data tables. We have a number of other selections we could have made to avoid us having to write some SQL code, some SQL code. But here we want to concatenate it. And unit one showed up on top and that's good. And what we want to do is concatenate unit two. So I'll click on that and I'll click that and let's give this a name. Let's call this. How about both this or not? Check it. I'll just check it here now. And we could create a source column and again as a check just to make sure everything is proceeding like we want it to proceed. I'll create the source column and I could keep the dialogue box open. That's right here. But I will close it for now since hopefully I didn't make any mistakes. Won't have to go back and we'll click the run button and let's see. It didn't come up. Let me try that again. Let me close that window. We'll leave it like that and let's see what happens. There we go. Must have fat fingers this morning. And this is a combined data table and as I mentioned, I'd like to keep that open. It's our source table. Normally what I do is drag that somewhere off the edge of the table. But for right now, we'll leave it there and we'll just scroll down a little bit and we note that we have unit one, unit one. There we go. Unit two. So that serves as a check. So we have that there. Now it's time finally to start the analysis. We can do a number of things here. First of all, we note that most of our variables are continuous except for our target variable, which is binary on or off, yes or no, et cetera. But let's just take a look at the distributions. I always like to point this out in the analysis. This is one of my favorite features to JMP. You can do a quick inspection right here to see if there's anything weird and let's see. Okay, scroll back over. I don't see anything that pops out at me. It looks like unit one and unit two. We're right in there. One thing to note right here is sepsis one there's. There's a lot fewer rows where the patient went into sepsis. And let's see, I think it was about 7%. So we're looking at 93 here and 7% here. So let me get rid of that. That's what I like to do. Now let's go up to the analyze menu, and we're going to make use of this pretty much exclusively from now until the end of the presentation. So I'm going to go down here, I'm going to choose multivariate methods from the drop down, and I get another drop down. And what I'm going to do, come on, is choose multivariate. And this window pops up and wants to know the why call. By the way, this is the reason why I like to put another reason I like to put the target variable over on the left. It's right here, and we can pop it into the Y column and we don't have to scroll and Hunt for it. So let's see, we've got everything in there. We don't want unit one or unit two. What else do we have? Gender, age? I'll tell you what, let's put them all in there and click my call hit. Okay. And here's what we get. We get a correlation matrix. Now let's take a little closer look at this. It's a little confusing because we put a whole lot of variables in. But again, that's one of the advantages to JMP. You can pop them all in there and don't have to write extra code for it. So we have our diagonal here. Our matrix is reflected along the diagonal. So it's the same data top and bottom and some different colors here. One means a high correlation. It's statistically significant, which is what we expect. The blood pressure, for example, should be correlated fairly well with itself. But we also note that right here, right under it, it's correlated with something called Map, which is mean arterial pressure. So that's mean between a systolic and a diastolic. So that makes sense. And we have DBP, which is diastolic blood pressure. So that makes sense. And there's not a whole lot we can see here. Here's another correlation. Okay. That's bun for the urea content. Blood urea nitrogen is what it stands for. That's the urea content of the blood, which is a byproduct of cellular physiology. And that looks like it's correlated with something over here, potassium, but pretty much that's about it. Here's a Hema crit, which looks like it's correlated with right over here, the HGB. If we read over, everybody see that square here? Take a second to look at it. And HCT is hemacrit, and HGB is hemoglobin. And hemoglobin is a direct measure of hemacrit or excuse me, the hemoglobin. And the hemochret is the I believe it's the volume fraction of red blood cells. So you would expect them to be correlated. So not a whole lot to see here outside of that. May as well close that. Next, we're going to go back up to the analyze menu. We're going to go to analyze. And from the drop down, we're going to go to screening. And again, we have another drop down. And let's Hover over this. It's called predictor screening. It screens many predictions for their ability to predict an outcome. So we want to be able to predict whether or not a particular patient is going to develop sepsis so that it looks like a good choice. So we click it and this is the window that we get. And again, we want to know is sepsis. One is yes, zero is no. So we click that and we're presented with the range again, the same range we had before. And let's do what we did before. Start here. We're going to go down here to gender, ignoring the units again, hit the shift button, click and we select all of them and we're going to hit the X button. And there's nothing else there for us to take note of. Doesn't look like anything else to click. So do that. We'll click. Okay. And there we go. And JMP tells us that it's doing a bootstrap forest. We could do a whole presentation on bootstrap forest, but we don't have time. In fact, we could probably do two or three or four. And we are getting there. It's scoring the results and we just have to wait. It's taking a while for some reason. And there we are. Let's look at what we have here. We have the contribution, which is the net contribution, not scaled, not scaled to the model. I have to use the word again portion that contributes to the model. And you can think of this as a weight fraction or if you prefer, multiply by 100 in your head to make it a percent. And if we just take a quick look at that, we see you at zero, 61 and three. So we're at zero, 74, six. Looks like it takes us up to zero, eight or 80% explanation. So that's what that is. So all those make sense. Now let's look at that Iculos before, I talked about excluding that well intensive care unit length of stay that predicts what looks like more than the others combined, probably. However, if we use that, that would be a little bit of circular reasoning. If people develop sepsis, they're almost certainly going to end up in the ICU. If they are really sick, they may be at higher risk of developing sepsis. So they are going to end up in the ICU. So if they're in their ICU, they're probably pretty sick to begin with. They're already developing sepsis and they're going to be in there for a while. So let's go up here in Excel. That really doesn't help us too much because it's not something we can measure like blood pressure. I mean, they're already in there or they're not in there. So let's exile to that and we'll go up here to analyze back down to screening, predictor, screening. Hit is sepsis for the Y response. And let's leave that one out. Let's leave the ICU Los out and we'll do everything exactly the same as before. Hit, shift, gender, select everything. Hit the X button. Nothing else for us to do there. It doesn't look like Hit. Okay. And we'll just wait for a little while. Again, looks like it's running a little bit faster this time. And here we go. Now we had the ICU Los completely out of the picture and we see something else here. The Bun blood urea nitrogen looks like it's in the running for a significant predictor after that temperature, which makes sense. If you develop sepsis, you have an infection. So you're probably going to have a temperature. Creatinine is a byproduct of muscle breakdown. So that makes sense. And remember, we did our research before we started the analysis and after that respirations, that make sense. Shallow, rapid breathing, hemoglobin content, Hema, crit. Okay, they're highly correlated blood pressure. And this WBC I didn't point out before, but that's white blood cell count. So that makes sense too. Now we have a decision to make. This is obviously the most prominent don't want to say important. We're not sure that yet, but it's the most prominent. After that comes temperature and creatinine. And then there's a large drop in the rankings and the importance of the rankings in the portion here. So we have a decision to make. So let's start up there with the blood urea nitrogen and let's go down. Let's pull in as much as we can because JMP is going to make this, all the repetitive tasks, all the calculations, easy for us by taking them away from us. So Shift click. Let's go down to systolic blood pressure. That's probably going to play a role because if you have sepsis, you tend to have very low blood pressure. Dangerous. And we have some other measurements down here, but we'll just skip those. By the way, these two are going to be correlated. This one right here is the partial pressure of the carbon dioxide in the blood and this is the carbonate content of the blood. So those are going to be related. So it doesn't look like there's anything else of importance. And JMP puts in a handy link there. It says copy selected. So let's do that. I copied the selected and I'll just leave that open for right now and we'll go up here to analyze once again. And what we want to do now is fit the model. It says fits a linear regression. So let's go there. And since we copied selection in the previous window, JMP remembered that for us. And what we want to do is click the add button and we'd like them to construct the model effects. That means we want to use them as a modeling variable. And what else we have here. Notice this upper right hand corner we have something called personality. So focus on that right hand corner for just the next 30 seconds or so. And I'll go over here and hit his sepsis and we'll put that in the wine. I'll look at the upper right hand corner personality. I click the Y and I get a choice here. I get some choices in the drop down menu and I get an emphasis window and I'll click the drop down triangle here and I have a whole lot of choices here and we won't go over them right now. But probably what I want is a generalized linear model. And if I Hover over it, let me try that again. It gives us a pop up window. It says fits a generalized model, and try it once again. And I get to select the distribution and the link and all that sort of thing, which I'll go over in a couple of seconds here, and I get another drop down here a link function. So let's start with distribution. And remember, we expanded that top column and we got our distributions. We took a look at it didn't look like there's anything weird there, at least on a macro scale. So let's just pick normal. And I want logic for the link function because we've got a binary variable that we're trying to predict. And let's see, take a closer look. Nothing else left for me to do. It doesn't look like. So I'll hit the run button and here we go. Here's our generalized linear model fit, and it gives us a summary up. Here what we looked at. We got a Kai square. I won't go over that in any great detail, but let's scroll down a little bit more, cut that off. Here we have an effect summary. And just by the way, if we click on these triangles, we can hide them or make them appear again. So depending on what we want to present, I'll just leave them all open for right now. And what we have up here is the source. And there it is. Bloody reinrogen. The bun is up there very high again, followed by temp, creatinine, white blood cell content, heart rate, and some blood pressure measurements. And all this makes sense from what we know about sepsis and log worth is contribution. And over here we have P value and we see that they're all very highly significant up to right about here, the white blood cell content. And we see a blue line here. And what that is the log worth of two. And the reason we take the log worth is because we'd like to be able to Doe things on the graph. So we put it on a log skill and that just makes it easier. Otherwise the spray bar up here would be off the edge of my screen. And this blue line here is a significance level because it's the log worth and zero one significance level is log of negative two. Excuse me, is the value of zero two, which is log of two. Get rid of the negative sign. And that's what this blue line is. So these are all significant up to HR. And if we come down here, we have the square results and we see some significance levels here, too. And basically, we're looking at the respiration, the bun and the creatinine and also the temperature. They're all highly almost forgot the white blood cell content down here. And I'm running out of time, so I won't explain a whole lot about that. But if we scroll down a little bit more, we can get an estimate for our predicted variables here. This is an estimate of the exponent and it gives us more statistical data on that. And I'm starting to run out of time. So let me just minimize those windows and we'll get rid of all the highlights and let's recap what we had here. Here is our original, Highly cleaned and rearranged data table. We want to predict sepsis, which is binary. And we ruled out the lens of stay in the ICU unit that is right here Because it didn't help us and it was kind of circular logic. And we've got our variables in three separate groups up here. We start off with the clinical and then we come over here and we had all our blood tests and then we had the demographic data and we had two units and we excluded all 25 or 30% of the data Because the data wasn't assigned to either unit and we don't know where it came from. So we got rid of that and then we subseted everything. Remember, we got to well, actually, we got three separate subsets Because the third subset was the missing data unit unit being in quotes. And we went from there and we went to the multivariate to look for some correlations. Then we went to the analyze screening, predictor screening, and we got what we figured was going to be our most valuable predictors to predict sepsis. And finally we went to fit model and let me reiterate on that. We went up to analyze fit model and we clicked that and we got this window and we dumped everything in there except for what we wanted to exclude. And we put sepsis right here in this y variable. And remember, this is the area that we had to focus on up here in the upper right hand corner. We had the personality and a couple of other selections to make, and we selected from that. And we got our results, which I went over about a minute ago, and that is the end of the presentation. I hope everybody enjoys even learned a little something from it. Thank you for watching, listening, and giving it your attention.
While basketball is a team sport, NBA analysts, coaches, and fantasy basketball enthusiasts are often interested in the performance of individual players. Optimizing or predicting individual performances can have an impressive impact on the outcome of a game. Individual player metrics can, for example, allow coaches to target a specific defender or avoid distributing the ball to certain offensive players based on their matchup. In this presentation, I demonstrate how we can quantify these matchups by using a Positional Matchup Model in JMP to predict offensive player performance, as summarized by an individual’s offensive rating relative to his primary defender’s statistics.  In the model I have created, I focus on key indicators such as average blocks, steals, height, and defensive rating, using JMP to clean data accessed from basketball-reference.com. I then select the optimal predictive model using the Model Screening feature. The final model displays an offensive player’s predicted offensive rating based on his defensive matchup during a future game. These results can be compared to a player’s own average to determine how much better or worse he is likely to play, based on an individual matchup.        Hello, everyone. My name is Kevin O'Donnell. I'm a JMP Global Customer Reference Intern, and today I'll be presenting on a personal analytics project that I've done that is the NBA Player Matchup Model. So just to get started, I would like to go over my idea and the motivations for building the model. I've always been incredibly interested in basketball. I've been a huge basketball fan for my whole life, and I was looking for a new analytics project to dive into, so I ended up asking this question here: Could we model a player's offensive success based on their averages and the strength of their opponents as well? So as most of us know from watching any sport, player performance varies greatly from game to game. It's based on a variety of factors and one of these being a player's average points. That will be a very strong predictor for points in an NBA game, but it's far from the only influence on a player's offensive performance. So going into this project, I wanted to build a model that predicted the players points in the game based on the points per game, but also a lot of other variables which we will look into on the next slide. So ideally, this will be helpful for coaches and team analysts to, for example, determine which of their offensive players are likely to over perform or able to perform really well based on their match ups and vice versa, which of their players they might want to avoid on offense because they're being guarded by some of the best defensive players on the other team. This could inform play calling and game planning, so if a particular defender is weaker, coaches may look to target that match up or look to force a switch onto their best scorer. In the entertainment realm, fantasy basketball players could use this data to figure out who to start or who to pick up off the Waiver Wire, so it has a broad range of applications. So here's a little bit of the overview of the data. I'll go through this quickly, and we'll see this as I go into the JMP demo a little bit more in detail, but we're going to predict points per game based on seven main categories of data. So we're going to use player offensive averages for the season, this includes things like points per game, 3.2 point percentages, and advanced metrics such as usage rate, which records how often an offensive player is used on their team. Player attributes such as height, wing span, vertical leap will also be included for both the offensive and defensive players because these physical attributes, I would assume, also contribute to how many points a player will score. Finally, in terms of the offensive players, we'll use career averages. The same averages that we're using for the season, we'll also use for the career to add a little more robustness to the model, and overall team pace and offensive rebounding percentage is also going to be a predictor because that will determine how many possessions there will be. If there are more possessions in the game, it's more likely that any given player will have more points. And so, on that same token, defensive team pace and defensive rebounding percentage will also be used. In terms of individual defensive stats, things like steals, blocks, fouls, and other advanced metrics such as defensive win shares, which measures an individual player's contribution to the team's wins on defense. Those are also going to be predictors to probably negate the amount of points that are scored by an offensive player. And then finally, the defender attributes will be used as I mentioned earlier. So before we get right into the data and the modeling and JMP, I would like to go over the matchup data that I was using for the bulk of the model. So the NBA recently implemented some new personal match up data collection and it's based on detailed player tracking. So it tracks the closest defender at every point, not just the primary defender on a play. It only tracks front court time and it will track partial possessions. So what this means is that a player could be guarded by as many as five different players on a single play, and each defender would be awarded the respective amount of matchup minutes for that possession. So here we have an example of Terry Rozier for the Hornets being guarded by four different players on different teams. In the first row, Steph Curry is guarding him in only one game for 2.7 minutes and allowed two points. So if we think of a hypothetical where the Hornets are playing Steph Curry's Golden State Warriors and Curry guarded Rozier for 10 seconds, and then Klay Thompson switched on to Rozier for another 10 seconds, both would be awarded those 10 seconds of matchup time. However, if Rozier scored a two pointer at the end of this possession, the points would be marked for Klay Thompson. This is really cool tracking data and I love how specific it is, but it did cause some problems when I tried to model the points per minute for each player and defender combination, which was my original plan. So I was originally going to use the offensive player stats and defensive player stats that I mentioned in the previous slide in every individual matchup. But since many defenders, like you'll see here, logged very small amounts of time, like Stanley Johnson only guarded Rozier for an average of under one minute a game, that's not enough time for some of these points per minute measurements to be normal. So going back to that Klay Thompson hypothetical, if Klay only guarded Rozier for 10 seconds the whole game and Rozier having to score two points in that possession, then Rozier's points per minute against Klay in that row would be 12. So obviously, extrapolating that to a full game, even excluding the back court time as this data does, it's very unrealistic. So I had to go with a slightly different, less ideal approach. Instead of using the individual defensive stats, I actually averaged them out for each combination of player and defensive team. So instead of using Rozier versus Curry and Rozier versus Thompson, I would use Rozier versus the entire Golden State Warriors average based on the amount of matchup minutes that each player defended Rozier. So for example, if Steph Curry and Klay Thompson both guarded him for half the amount of possible match up minutes, then the team average, let's say steals per game, would just be the arithmetic mean of Curry steals and Klay steals per game. The same would go for every defensive variable and that's obviously a simple example, but the same goes for every player versus every team. So obviously this is not ideal because it minimizes the individuality of the matchup. But I had to abandon that points per minute approach because the samples were too small and the response was heavily distorted. So this model that I've created with the aggregated data is more accurate than my initial attempts, even though it sacrifices some individuality. So at this point, I'm going to switch into JMP here and do a little bit of the demo on how I built the model and some exploratory data analysis to begin. So to begin, we're just going to look at some marginal relationships with points in the graph builder, and I'm able to choose any I want. But for now we're going to look specifically at three variables, the first being points per game. So we see here, we have a moderate relationship with points and points per game, and it is positive as we would expect. The average points that someone puts up in a game, or over the course of the season rather, is going to obviously influence how many points they score in a game. Similarly, something like usage rate, the advanced statistics that I was discussing earlier, also has a positive relationship, just slightly weaker than the relationship between points and points per game. Finally, if we look at some of the career stats, if we look at career points per game, then again we see a positive relationship, but that is a little bit weaker because it's averaged over the course of a player's career rather than the season that we're currently in. But hopefully this could help adjust for some major differences in the points per game totals or averages rather. So now that we have a sort of an idea of the data itself, we can move into the simple linear regression with points by points per game as a benchmark for this model. So as I mentioned, you can predict based on just points per game and it will give you a decent prediction. But we're looking to improve that prediction by adding some of these different variables, offensive and defensive. So if we run this script here, this is just comparing some runs of a simple linear regression on the training and validation set using KFolds validation throughout. And if I run here using the Hidden Validation 2 column that I'll be using for the remainder of these models, then we can see the regression here. So here's a regression plot. It looks pretty scattered. There's not a clear linear relationship, but the RS quare is moderate, meaning that 46.5 percent of the variation in points is accounted for by this average points per game, which is pretty good. And the Root Mean Square Error is about 5. 27. The Root Mean Square Error is the standard deviation of the residual, so it's essentially the variation in how well our model is predicting. Its standard deviation is around five. And we can also look at some other measures, such as the AIC, which is used to compare models for predictability. So AIC is another important measure, and it measures how well the model will predict relative to the number of predictors put in the model, just to make sure you're not adding way too many. So in this case, the AIC is very high, and we can come back to this number as we look at the comparison between this and the multiple linear regression that I will show later. So while I was trying to pick the model that I was going to choose, I decided to use the model screening feature in JMP, which allows you to select your response variables and all the factor variables. I ended up putting in all of the numeric variables that I had for a full model just to see initially which models perform the best. And I was able to choose from a variety of these different methods, including XGB oost and Generalized R egression, and then of course, just the normal Lease Square. In the interest of time, this would take forever to run because it's using some of these machine learning algorithms. So I'm instead just going to pull up a quick screenshot of the model screening. So here we have the output from this model screening. It shows the RS quare. Again, I was using KFold validation with ten folds here. So this shows the RS quare and the Root M ean Square Error for each run. So we can see that the Least Square Fit actually had a very strong fit compared to some of the machine learning algorithms, which I was surprised by. But it actually helps for interpretability because some of these machine learning algorithms, they're more of a black box. And if I were to use those, it wouldn't be as easy to interpret the coefficients or see which variables are truly significant or which are just being used for prediction. So because the fit of Least Square and the Lasso Regression, which I'll get into a little bit later, were so high, it's actually a good sign that I'm able to use that for better interpretability. So as we can see, the most accurate model is this multiple linear regression with Lasso Regularization. Lasso Regularization is a statistical technique that regularizes the model and selects features to minimize multicollinearity. Multicollinearity is the correlation between predictors and that can negatively affect the model. So using this technique, we're able to take the full linear regression model shown here and remove some of the variables to minimize the multicollinearity and maybe satisfy some of the linear assumptions better. So with that information, I was able to take the variables selected with Lasso Regularization and then create a multiple linear regression using those variables. So as you can see here, the Actual by Predicted Plot looks pretty similar. It might be slightly closer to the line of best fit here, which is a good sign. And we can see our effect summaries, but we're going to scroll a little bit past that so that we get to the summary of our fit. So compared to the simple linear regression, the RS quare and the Adjusted RS quare are very similar and that's probably because points per game is such a heavily influential factor in this, as well as the simple linear regression, obviously. So the fit is not too much different. However, adding these different variables shown here does improve the Root M ean Square Error a little bit. It went from about 5.3 to little over 5.2, which is not a drastic improvement, but an improvement nonetheless. But the real difference here is in the AIC and comparing these models in terms of their predictability, the AIC dropped significantly from the simple linear regression to this multiple linear regression with Lasso R egularization, which is definitely a good sign. We see some of these Parameter Estimates down here and of course the results for RS quare and Root Mean Square Error in cross validation. So looking at some of these Parameter E stimates, the points per game, again very significant, it's significant at the .05 a lpha level and many below that. And this is to be expected, as we've explained already. Turnover percentage is also significant and this has a negative relationship with points. As one would expect, if you're turning the ball over more, that's less opportunities to shoot the ball, less opportunities to score. So that checks out with just our knowledge of basketball. Here we have the defensive pace, or the pace of the defensive team, and this has a slightly positive relationship with points conditional on the other factors in the model. Again, that's to be expected because the faster a team plays, that's generally more chances. However, it might not have such a strong effect as turnovers or points per game. And then finally, defensive rebounding percentage. This is just how often the defensive team hauls the defensive rebound. Again, this is preventing second chance points for the offense, so it should have a negative relation ship, which it does. So all these things check out, and then some of these other variables are insignificant conditionally, but included because they improve the predictability of the model. So something like usage rate might not be significant at the .05 level, but it nonetheless improves our predictions. Now, knowing that points per game might not be available at the beginning of the season, to the extent that it is near the end of the season, meaning that points per game might be a little less reliable with a smaller sample size. I'd look to create an alternate model without that to see if I could still predict better than the simple linear regression and similar to the Lasso regression, but without dependency on a points per game for the season measure. So this alternate model, I created using backward selection by AIC. And again I left out the season points per game, so it still includes the other points per game measurements, or other per game measurements rather, like field goals attempted from three and from two, which along with some of these other variables can be a proxy for that average points per game. So it' s not completely robust against early season fluctuations and small sample sizes. But it's possible that these variables might be a little bit more representative of how a player is going to perform in the long run. I'm thinking that maybe players might be a little more consistent with their attempted stats, or the rate at which they're shooting the ball, rather than just the amount of points they get that could vary based on just a small sample size. These could as well but... Just using this model as an alternative, and it turns out that it actually predicts pretty similarly to the model with points per game in it. So it might not be favored necessarily if points per game is available. Inappropriate but it provides a similar prediction. So we can see that the RS quare is pretty similar, may have increased a little bit, and the Root Mean Square Error is again similar. The AIC suggests that the other model is a better model for predicting based on the amount of variables included, so that is definitely something to note. But each model has its advantages. This one, particularly, we can see some of the conditional relationships of the other variables, particularly those on the defensive side, that we couldn't see as much in the other one because points per game was dominating so much. So we see here that two pointers and three pointers attempted are weighing insignificantly for points, which again, that makes sense because the more shots you're taking, the more likely you are to score more points. And then there are some other variables that make sense, and some others that maybe are a little confusing at first sight. So offensive win shares, that being a significant variable makes sense that's measuring how much a player is contributing to wins on the offensive side, it makes sense that that has a positive relationship as we can see down here. So offensive win shares right here has a positive relationship. And also the defensive rebounding percentage down here decreases the estimation again, so that is pretty consistent with what we've seen in the previous model. Average fouls per game, average personal fouls per game down here, has a negative effect which I thought was interesting. So a player or team with a higher foul total is going to negatively affect the points scored for the offensive player. This might mean they're more aggressive. I would assume that they are playing more intense defense in limiting points through steals, blocks, or heavily contested shots, and as a result, they're getting more fouls called on them. However, this is not all good for the defense because since the player can foul out with six fouls, there is a certain balance to strike in the defensive end. You don't want to be too aggressive because then you could be giving up easier points or you could be leading your players into foul trouble. So that's an interesting variable, I think, to consider. And of course, this is a conditional significance, so it might change slightly based on the removal or addition of certain other variables. And finally, a couple of defensive relationships are particularly confusing at first glance, I think. Specifically involving blocks and defender height. So we think of basketball as being very dependent on height. If you are taller, you're more likely to go to the NBA. Seven footers, I think you have a 20 percent chance of just going to the NBA even if you are seven feet. So this is something that we think, if you block shots more and you're taller, you're going to be affecting the offense's points negatively. However, we see that these relationships are actually conditionally positive, which is very interesting that average blocks per game and the defender height, as well as the blocking rate, are all positive relationships. And so initially, I was confused by this, but I think this has more to say about the players that these players are guarding rather than the actual variables themselves. So what I mean by this is when you consider that taller players with better blocking stats are big men playing power forward or center, they're guarding other big men. And so that makes sense a little bit more. Guards tend to put up more points in the NBA with the emphasis on three- point shooting now. And a lot of offenses are run through some of these smaller players who tend to be guarded by other smaller players. Whereas these taller players, with more blocks, are guarding big men who maybe aren't the focal point of the offense, aside from certain players like Jokić and Embiid and Giannis. But this causes a positive relationship, but it's really more a function of the position that these players are playing. So I thought that was an interesting conditional relationship to highlight within the model, and it involves a little bit more deeper thinking about the relationship between blocking, height, and the points that an offensive player puts up. So in terms of the model overall, we can see that this one and the Lasso regularized model are very similar in their predictions. We'll see that in more detail soon when I flip to the 2021 predictions, but the choice then might not be too significant. Both have their advantages. Of course, this one allows us to see the significance of more of the variables, specifically the defensive variables, whereas the first one has a slightly lower AIC and might be better for predictions if the data is available. So both leave a little bit to be desired in terms of predicting much more reliably than the simple linear regression. I would have liked to see this Root Mean S quare Error decrease more, and it's something I would look into as I continue this project, looking to gather better data, trying to make the matchups more individualized without sacrificing the normality of the response variable, things like that. But with that being said, these are the models that I have now, and we can look to test these on the 2021 season thus far. So if I switch over here to the matchups for 2021, we have the same data table just with this year's matchups. And so I'm just on Josh Hart's matchups right now because I go to Villanova and he's a Villanova great, so here we have him as a player, his team, and the defensive teams, his stats, their stats. As we've already seen, these are all the variables that could be included in the model. Apologize for the quick panning, but here we have the predictions at the end. So these prediction variables, or rather these columns, are the model predictions. So our first one, our first model, is the Optimal Multiple L inear Regression that's using the Lasso regression and including the points per game variable. As you can see in this game, he is predicted to put up 12.8 points and the residual here is about five. So in reality this residual is the points minus the prediction. So he actually put up, this is around 13 and this is around five, so he put up about 18 points here instead of our predicted 13. So that's obviously not a great measure. And then we can look at some of the alternative model. The alternative model prediction is very similar, and we will see this in greater detail as I flip back to the PowerPoint and show you kind of a condensed version of this data table, because right now I know that this in JMP is a little bit overwhelming because there's so many variables, so many random numbers being thrown at you. So I'm going to switch back into the PowerPoint to show you some of the predictions for both Josh Hart here and for Kevin Durant. All right, so now that we have our two models and the simple linear regression to compare it to, we can apply these predictions to games that Kevin Durant has played this season. So here are four games played in Atlanta, Charlotte, Chicago, and Cleveland. In the first one, he scored 31 points. In actuality, both our models predicted close to 26 points, so the residual is just around five. However, we see that when Kevin Durant is close to his average points around 28 or 29 points per game, when he actually scores that, the predictions are very close because points per game weighs so heavily in these models. So, for example, in this Cleveland game, our first model predicted 26.2 points, he actually scored 27. Therefore, the residual is less than one. And similarly here, the residual's less than one for the alternative model as well. So we see that the model excels when players perform close to average, which they will do most of the time. But there's obviously variation, like in this game against Charlotte, he had a particularly good game. He scored 38 points. It's a great game, but particularly good game by Kevin Durant standards. And so the model predictions are much farther off. And we see the same thing with Josh Hart. His points average's a little bit lower because he, again, carries a lighter load. Kevin Durant is a star player on the Nets, so he's getting a lot of touches and taking a good portion of those shots. So here we have Hart's production. Again, his points average is closer to these 12 and 14 range. So the residuals here are very small. When he puts up 14, our models predicted around 13.5 and the residuals are around .5 . So we have really good predictions for when he scores an average amount of points or close to an average amount of points. But when he scores less or much greater then the predictions tend to suffer a little bit. So with all this being said, I know I ran through these models. Let's take a step back and look at the limitations in detail and some further study that I could embrace. So first of all, these are using averages by the offensive player and defensive team as opposed to offensive player and individual player. As I've mentioned, that would be the ideal scenario, but as I was doing it, it didn't work out like that. So that could be causing some problems because it's duplicating some results. Instead of being sort of a one to one, if a player is being guarded by this player, this is how many points they will score. If a player is being guarded by a combination of players on this team, this is how many points they will score. So it's easier to predict but maybe a little less accurate. Additionally, the averages are for the entire season, which means predictions toward the beginning may be less accurate, as I mentioned, which is something that I think the career values try to remedy a little bit. It could be worth adding things like points per game in the season variables with a lag of maybe five seasons. So taking the average points per game for the last season, two seasons ago, three seasons ago, and so on to try to add some more variables and maybe predict a time or a trend in player performance. Additionally, player performance is dependent on countless other factors such as cold or hot streaks, how well they've been performing as of late, injuries on their own team, Injuries on the opposing team that could increase or decrease their role, minor injuries that they are dealing with themselves that could decrease or increase their role. And then things like rest days, travel time, and a lot of other intangibles like NBA players are human and we all know some days we're not feeling the best, other days we're feeling great, more energetic. Those types of things could lead to better performance or worse performance. So these intangibles are not going to be something that I can factor into the model but it's important to recognize that they could still affect the points in a given game. In terms of further study, I would love to kind of rectify all these limitations and look to predict a more holistic variable such as offensive rating. So offensive rating is going to predict a player's points per 100 possessions contributed to the game instead of just one aspect of the game in points. I would love to predict something like this or flip it to the defensive side and predict a defensive rating based on who they're likely to go up against on offense. So something like that I think would be really cool and it would extend the application of this more toward coaches and team analysts instead of maybe some of the fantasy basketball players who are looking for a strict measurement like points or something like that. So with all this being said, I'm definitely going to continue working on this project. It was a lot of fun and I love looking at these models and interpreting from a basketball standpoint what's going on. If you have any questions, please feel free to put those questions in my community post in the comments. I'll be happy to answer them. And if you have any suggestions as well for further study, I would also be happy to take those on. Thank you so much your time.
When you collect data from measurements over time or other dimensions, you might want to focus on the shape of the data. Examples can be dissolution profiles of drug tablets or distribution of measurement from sensors. Functional data analysis and regression-based models are alternative options for analyzing such data. Regression models can be nonlinear or multivariate or both. This presentation compares various approaches, emphasizing pros and cons and also offering the option to combine them. The underlying framework supporting this work is information quality, which permits us to consider the level of information quality provided by the two approaches and the possible advantages in combining them. The presentation combines case studies and a JMP demo.     Hi, I'm Ron Kenett. This is a joint talk with Chris Gotwalt. The talk is on Functional Data Analysis and Nonlinear Regression Models. And in order to examine the options and what we get out of this type of analysis, we will take an information quality perspective. In a sense, this is a follow up to a talk we gave last year at the same Discovery Summit. So I will start with simple examples to introduce FDA and Nonlinear Regression. And then Chris will cover a complex and more substantially more complex example of optimization, which includes mixture experiments designed to match a reference profile. So the story starts with data on tablets that are dissolved and measurements are done at different time points, five minutes, ten minutes, 15 minutes, every five minutes, and then 20 minutes, ten minutes later, 30, and then 15 minutes later, 45 minutes. We have 12 tablets that are our product and 12 tablets that are the reference. Our goal is to have a product that matches the reference. And in this type of data, we have a profile, and we consider two options, FDA and NLR. In Chris' example, we'll talk about something called the F 2, which is a third option for analyzing this type of data. So here's what it looks like with the graph builder. On the left, we have the reference profiles. On the right, we have the tablets and the test. We can see this is an example from my book on modern industrial statistics, the book with Shelley Zacks, which is now in its third edition. So on the left you can see there is a tablet that seems a bit different. It's labeled T 5R. And if we run a functional data analysis of this data, T 5R does look different. We see that the growth part is different. It has a slow growth but consistent growth. It does not have the shape that we see in the other dissolution curves. This was done with a Quadratic B-spline with 1 knot, and the quadratic was in this case, fitting the data better than the QB. This is a bit of an unusual situation. So because of the shapes, the Quadratic B-spline was a better fit. If we look at T 1R, the first tablet that has still different shape, it shoots up and then it stays. So basically, the tablet has dissolved. Obviously, beyond a high number of dissolution, there's not much left for the solution. So T 1R and T 5R, they seem different. T 5R stands out more than T 1R. So, yes, T 5R on the cluster analysis on the functional principal components does stand out. So here we see how functional principal components scatter plot of the first two functional principal components points, what we observe visually. And T1R, which is next to T 2R, is a different cluster. We can proceed with a nonlinear regression approach. Here we are fitting a Gompertz three parameter model with three parameters, the asymptote, the growth rate, and the inflection point. This is the model and when we fit the profiles, we again see that T 5R stands out. So we have the same qualitative impression that we had with FDA. Now we have these three parameters listed and we can run a profiler on the model because we now have a model. This is where T 1R stands. So by running the profile on the different tablets, we can also see how similar or different they look like. This is the table that maps out the parameters of this asymptotes. so T 1R growth rate .21, T5R, this tablet that stood out. Growth rate .075, very slow growth rate, consistent but slow. The inflection point is 11.5, way on the right. So we can see through this parameter values the difference. We can also pick up two tablets that stand out for growth rates. T2R 1.77, and T8R, almost no growth. We'll get back to T 2 and T 8 in a minute. If we take the principal components of these three parameters space. So we conceal the parameters as if these are the measurements and we run a multivariate control shot. We can see T 1, this is the first one and we can see T 5, this is the blue one, the fifth one. This we already saw. They are within the control limits of the T Square multivariate statistical distance control chart. And T 2 and T 8 that I highlighted before now stand out and we can see qualitatively why. This is the model degradation approach in the guidance documents that is used for modeling dissolution curves. In running such analysis with an information quality perspective, the first question is asking what is the goal of the analysis? And then we can consider the method of analysis. Here we're using nonlinear regression and functional data analysis. Chris will get into how this is combined with data derived from experimental design. We have a utility function and the information quality is the utility of applying a method F on data X conditio n of the goal. It is evaluated with eight dimensions. And here Chris again we'll talk about data resolution and data structure. So Chris, the floor is yours. Thanks Ron. Now I'm going to give an example that is a little bit more complicated than the first one. In Ron's example, he was comparing the dissolution curves of test tablets to those from a set of reference tablets. In that situation, the expectation is that the curves should generally be following the same path. And he showed how to find anomalous curves that deviate from the rest of the population. In this second example, we also have a reference dissolution curve. But we are analyzing data from a designed experiment where the goal is to find a formulation in two polymer additives and the amount of force used in the tablet production process that leads to a close match to the reference splashes the solution curve. The graph you see here shows the data from the reference curve that we want to match. To do this, I'm going to demonstrate three analyses of the data that use different methods and models to find factor settings that match the reference that lead to a... To do this, I'm going to demonstrate three analysis of this data that use different methods and models to find factor settings that will best match the reference curve. In the first analysis, I'm going to summarize each of the DoE curves down to a single metric called F2 that is typically used in dissolution curve analysis, a measure of agreement with the reference match. There, I'll use standard DoE methods to model that F2 response and then find the factor settings that are predicted to best agree with the reference. In the second analysis, I'll use a functional DoE modeling approach where I model the curves using these blinds, extract functional principle component stores, and model them. I'll load the reference batch as a target function in the Functional Data Explorer platform and then use the FDoE profiler to find the closest match recommended by that model. These first two approaches use little subject matter information about these types of tablets. In the third analysis, I'll model the curves using a nonlinear model that was known to fit this type of tablet well and use the Curve DoE option in the fit Curve platform to model the relationship between the DoE factors and the shape of the curve. I want to credit Clay Barker for adding this capability to JMP Pro 16. I think it has a lot of promise for modeling curves whose general shape can be assumed to be known in advance to come from one of the supported nonlinear models. At the end, verification batches were made using the recommended formulation settings for each of the three analyses, and we compared them to a new reference batch. What we found was that the nonlinear regression-based approach led to the closest match to the reference. What we see here is a scatter plot matrix of the four factors in the designed experiment. There was a mixture constraint between the two polymers, as well as a constraint on the total amount of polymer and the proportions of the individual polymers. Here's a look at some of the raw data from the experiment. At the top of the table, we have data from the reference that we wish to match. There are 16 DoE formulations or batches in the experiment. We can only see data from two of them in this picture, though. There were six tablets per formulation. There were four dissolution measurements per tablet. Here we see plots of the dissolution curves for each of the 16 DoE formulations with the dissolution curve of the reference batch here at the lower right. Now, I'm going to do a quick preliminary information quality assessment using the questions that you'll find in the spreadsheet that you can download from the JMP user community page. The first part of the assessment is related to the data resolution. In this case, I think we're looking pretty good. The data scale is well aligned with the stated goal because it's a design experiment. The measuring devices seem to be reliable and precise, and the data analysis is definitely going to be suitable for the data aggregation level, and we'll be illustrating different kinds of data aggregation as we extract features from these dissolution curves. As far as the data structure goes, we're in pretty good shape. The data is certainly aligned with the stated goal, we don't have any problems with outliers or missing values, and the analysis methods are all suitable for the data structure, although we do see some variation in the quality of the results depending on the type of analysis we do. As far as data integration goes, this is a pretty simple analysis. We have multiple responses, and we're exploring different ways of combining them into extracted features. So there's a common workflow to all three of the analysis I'm going to be showing. First, we have to get the data into a form that is analyzable by the platform that we're using. Then there's around a feature extraction. Then we model those features. That's where there's a lot of difference between the methods, and then we use the profiler in different ways to find a formulation that closely matches the reference. First, I'm going to go over the F2 analysis. F 2 is a standard measure of agreement of a dissolution curve relative to a reference dissolution curve. In the formula, the Rs are the means of the reference curve at each time point, and the Ts are the means of the non-reference curve. The convention is to say that the two curves are equivalent when F2 is greater than or equal to 50. It's important to point out that I'm including this F 2 based analysis not just as an example of a dissolution DoE analysis, but more broadly as an example of how reducing a response that is inherently a curve down to a single number leads to a much lower quality analysis and results at the end than a procedure that treats curves as first-class citizens. So now I'm going to share the F2 analysis of the dissolution DoE data. The first thing we have to do is calculate the batch means of the dissolution curves at the different time points. Then we create a formula column that calculates the F2 dissolution curve agreement statistic for each of these curves relative to the reference batch, and we modeled the F2 using the DoE factors as inputs and use the profiler to find the factor settings that match the reference. Before the analysis, we use the table's summary feature to calculate the means of the dissolution measurements by batch and across each of the times. We can save ourselves a little bit of work by using all of the DoE factors here as grouping variables so they'll be carried through into the subsequent table. Now we have a 17-row data set and we hide and exclude the reference batch. Now take note of the values of the dissolution means for the reference because we're going to use those when we create a formula column that calculates the F2 agreement metric for each of the batches relative to the reference batch and now we're going to be able to use this F2 formula column as a response to be modeled. We use the model script created by the DoE platform to set up our model for us. We place F2 as our response variable and we're going to analyze this data today using the generalized regression platform in JMP Pro. When we get into the platform we see that it has automatically done a standard lease squares analysis because it found that there were enough degrees of freedom in the data for it to do so and it's given us an AICc of 155.6. I'm going to see if we can do better by trying a best subsets reduction of the model and when we do that we see that the AICc of that best subsets goes down to 136 smaller is better with the AICc and the difference of 20 is pretty substantial so I would conclude that the normal best subset is a better model than the standardly squares one. I'm going to try one more thing, though, and fit a log normal distribution with best subsets to the data. When I do that, the AICc goes down a little bit further to 130.6. That's a modest difference, but it's good enough that I'm going to conclude that we are going to work with the Lo gNormal, especially because we know that we're working with a strictly positive response and the LogNormal distribution fits data that is strictly positive. From there, the analysis is pretty straightforward, so I'm going to jump straight ahead to using the profiler. F2 is an agreement metric that we want to maximize. So we get into the profiler, we turn on Desirability functions and have them set to maximize, and then maximize desirability to find the combination of factor settings that this model says gives us the closest match to the reference, and that would be at this combination of factor settings that we see here. Now the F2 analysis is complete, and we're going to go into the second analysis, the functional DoE analysis. For this analysis, we're going to work with the data in a stacked format where all of the dissolution measurements have been combined into a single column, and we have a time column as well. The first thing we do is go into the Functional Data Explorer platform. In the platform launch, we put dissolution as our response, time is our X, the batch column as our ID, and we supply the four DoE factors as supplementary variables. Once we're in the platform, we take a look at the data using the initial data plot. This particular data set doesn't need any clean up or alignment options, but we are going to go ahead and load the reference dissolution curve as a target function. For relatively simple functions like these, I typically use B -Splines for my functional model. When we do that, we see our B-Spline model fit, and the initial fit that has come up is a cubic model that is behaving poorly. It's interpolating the data points well, but kind of doing crazy things in between them. So I'm going to change from the default recommended model over to a quadratic Spline model instead of the cubic one. We do that by simply clicking on Quadratic over here in the right of the B-Spline model fit. We'll see that this quadratic model fits the data well. A functional principal components analysis is automatically calculated, and we see that the Functional Data Explorer platform has found three functional principal components. The leading one is very dominant, explaining 97. 9 percent of the functional variation. And it looks like this is a level set up or down kind of shape component. The second one looks like a rate component, and the third one almost looks like a quadratic. Looking a little closer at this quadratic B-Spline model fit, we see that this model is fitting the individual dissolution curves pretty well. So now we're ready to do our functional DoE analysis. Each of our individual dissolution curves has been approximated now by an underlying mean function common to all the batches plus a batch dependent FPC score times the first eigenfunction, plus another batch dependent FPC score times the second eigenfunction, and so on with the third one. What we're going to do is set up individual DoE models for each of these functional principal component scores as responses using our DoE factors as inputs. The Functional Data Explorer platform, of course, makes all this simple and kind of ties it up into a bow for us. And when I say that, it ties it up in a bow for us, what I really mean is the FDoE profiler. So this pane here shows our predicted trajectory of dissolution as a function of time, and then we can see how that trajectory would change by altering the DoE factors. That relationship with the DoE factors comes from these three generalized regression models for each of our functional principal component scores. If we want, we can open those up and we can look at the relationship between the DoE factors and that functional principal component score, and we could even alter the model by moving around to other ones in the solution path. I just want to point out that it's possible to change the DoE model for an FPC score. In the interest of time, I'm just going to have to move on and not demonstrate that, though. We have diagnostic plots, the most important one probably being the Actual by Predicted Plot. This has our plotted dissolution measurements on the Y-axis and the predicted dissolution values using the functional DoE model. And as always, we want to see that plot have data points tight along the 45 degree line. And in this case, I think this model looks pretty good. We don't want to see any patterns in our residuals, and I'm not seeing any bad ones here. So this model looks pretty good, and we're going to work with it. So I've already explained how this pane right here represents the predicted dissolution curve as a function of time and the individual DoE factors. Now, these other two rows here are because we've loaded the reference as a target function. So this row is the difference of the predicted dissolution curve from the target reference curve. And then the bottom pane here is the integrated distance of the predicted curve from the target. When we maximize desirability in this profiler, it gives us the combination of factor settings that minimize this integrated distance from the target. So I'm going to do that by bringing up maximize desirability. And now we see the results of the functional DoE analysis, where we have identified .725 of polymer A 275 of polymer B, a total polymer of about .17, and a compression force of about 1700 minimize the distance between our predicted curve and the reference. Now, we've done two analyses. Both of those analyses have recommended that we go to the lowest setting of polymer A and the highest setting of polymer B. They differ in their recommendations for what total polymer amount to use and how much compression force to use. The third analysis I'm going to do is the Curve DoE analysis. This is going to be structured pretty similar in some ways to the functional DoE analysis, in that we're going to use the same version of the data where dissolution measurements are all in one column and we have a time column. But we don't have a built-in target function option in the Fit Curve platform yet. So the first thing we have to do is fit just the reference batch and save its prediction formula back to the table. Then we do a Curve DoE analysis, which is largely similar to a functional DoE analysis in that we're extracting features from the curves modeling the curves. Then we go to the graph profile to find settings that best match the reference. The nonlinear model that we're going to be using is a three parameter Weibull Growth Curve, which has a long history in the analysis of dissolution curves. Weibull Growth Curve have an asymptote parameter A that represents the value as time goes to infinity. There's what's called the inflection point parameter that I see is a scaling factor that kind of stretches out or squeezes in the entirety of the curve. And then there's also a growth rate parameter that dictates the shape of the curve. What I think is really valuable about using this model relative to the functional DoE model or the F2 type analysis is that we're going to be modeling features extracted from the data that have real scientific meaning, especially the asymptote and inflection point parameters. Now Curve DoE analysis doesn't have a target matching capability like the Functional Data Explorer. So we begin the analysis by excluding all of the DoE rows in the data table. These are represented with the set column equal to A. So I select a cell there, select matching cells, and then hide and exclude those rows so that I only have the reference batch not excluded. Then I go to the Fit Curve platform, load it up, get in there, fit the Weibull growth model, and then I save that prediction formula back to the table. Once we complete the Curve DoE analysis, we're going to compare the Curve DoE prediction formula to this reference predictor to find combinations of the factor settings that get us as close to this curve as possible. So now we unhide and exclude the DoE batches, go back into the Fit Curve platform, just like the Functional Data Explorer platform. We're going to load up the DoE factors as supplementary variables. Now that we're in the platform, we can fit our Weibull growth model. The initial fit here looks pretty good. Looks like we're capturing the shape of the dissolution curves. One thing I like to do next is to make a parameter table. This creates a data table with our fitted down the near regression parameters. I like to look at these in the distribution platform to see if there are outliers in there or anything unusual. I also like to look at the patterns I see in the multivariate platform just gives you a better sense of what's going on with the nonlinear model fit. Once we know that everything is looking pretty good, we can do our curve DoE analysis and this looks very much like the functional DoE analysis from before. We have a profiler that shows the relationship between dissolution and time and how that relationship changes as a function of our DoE factors. And then we also have a generalized regression model for each of those three parameters that we can take a look at individually. The first thing I would do before trying to use the model in any way is look at the Actual by Predicted Plot, so that's what we see here. This is the predicted values incorporating both the time model from the mean function and the eigen functions, as well as the DoE models on the nonlinear regression parameters. This looks pretty good because there is a fairly easy interpretation for the viable growth model parameters., It can be useful and interesting to open up the individual model fits for these parameters. For example, here are the coefficients for the inflection point model. Because inflection point is a strictly positive quantity, a LogNormal best subsets model has been fixed with the data by the generalized regression platform. We see that the mixture manifests have been forced in and that the compression forced by polymer A interaction is the only other term in the model. What this means is that if we hold the polymer proportions constant and increase the compression force, we would expect a larger value of the inflection point. One would observe this as a tablet that takes longer to dissolve, which is exactly what we would expect to have happen. We can save the curve DoE prediction formula back to the table and we can see in all its gory detail how the model for asymptote inflection point and growth rate are combined with time to come up with our overall prediction for the dissolution curve based on. Fortunately, with junk we don't have to look at the formula too closely though, because we have profilers that let us see the relationships in a visual way rather than an algebraic one. To solve our problem of finding the combination of factors that give us the dissolution curve that would be closest to the reference, I created a formula column that calculated the percentage difference of the predicted curve, taking into consideration the DoE factors from the reference. The last step of the analysis is to bring up this percent difference response in the profiler that is under the graph menu, being sure to check the Expand Intermediate formulas option. This led to a profiler where we're able to see the percent difference from the reference as a function of time and the DoE factors, I've shared the region where the difference is less than one percent in green. By manually adjusting the factors, I was able to find settings where the predicted curve is less than one percent from the reference across all time values. This looks really good, but in practice I bet that this is overly optimistic. Here we see the optimal values of the factor settings for all three analyses. The curve DoE analysis is in the interior of the range for the polymers. The optimal value for total polymer is . 16 which is close to the functional DoE analysis result, and compression force is in between the optimal values recommended by the F2 analysis and the functional DoE analysis. After this, we made new formulations based on the recommended factor settings from each of these models and measured their dissolution curves as well as took a new set of measurements from the reference. Here we see a summary of the final results from the verification runs. The new reference distribution curve is in black, and the curve DoE in green is the closest curve to it, followed by the F DoE curve in blue. The result of modeling F2 is in red, and it did the poorest overall. This should perhaps not be too surprising. The F2 metric was the simplest, reducing the data down to a single metric and did the poorest. The functional DoE model had to empirically derive the shapes of the curves and then model three features of those shapes, essentially using more of the information in the data. The curve DoE led to the best formulation because it used the data efficiently via some prior knowledge about the parametric form of the dissolution curves. We see that the results of the F2 based analysis are not equivalent with the new reference patch, while the approaches that treated curves as first class objects are equivalent. What this means is that the approach would have required at least another round of DoE runs, and so an inefficient analysis leads to an inefficient use of time and resources. I'm going to close the presentation with a retrospective Info Q Assessment of the Results. Overall, we found that the curved DoE prediction generalized the best to new data, but was the most difficult analysis to perform. I want to note that if we didn't have a known nonlinear model to work with that fit the data well, we could not have done that analysis. The functional DoE analysis and the F 2 based approach can be used more broadly in other situations. The profiler leads to excellent communication scores for all three analyses. The ability to see how the shape of the dissolution curve changes with the DoE factors in the functional and curve-based approaches leads me to give them a better communication score. I see the curve DoE approach having the highest communication score by a little bit because we're directly modeling more meaningful parameters than the functional DoE approach. That's all we have for you today. I want to thank you for your time, interest, and attention.
These days, nearly every type of process equipment is able to collect a lot of data and then make it available to export for further analysis. This is especially true in R&D labs, where the equipment replicates the manufacturing process on a smaller scale and makes the analysis of data more useful to understand the process. However, many types of equipment use proprietary software, which can make it difficult to analyze the data coming from different systems. Exporting all the data into JMP data tables makes analysis easier, especially when it is in a familiar interface. It can also help link results from different process steps, driving us (hopefully) to discover unsuspected relationships. Over the last few years, we have imported data from several pieces of lab equipment into JMP, from the most automated solution to the "do it yourself" ones. The result was always the same: better data exploration.       Welcome to my speech. I am Paolo. I work in a research and development laboratory in a pharmaceutical industry, Menarini from Italy. And now, I show you some of our work in [inaudible 00:00:16] with JMP. Nowadays, almost every process equipment is able to collect a lot of data, making it available to exporting and for future analysis. And especially in a R&D lab, equipment replicate in a small scale of the manufacturing process. So the analysis of data may be helpful to do a process understanding. But each equipment run its property software. Sometimes data analysis can result uncomfortable if done on the onboard system for so little screen or some touch screen that not properly easy to use. Export all data on a JMP data table make analysis easier and more comfortable. And also, it can help to link results from different process steps and driving us, hopefully, to discover an unexpected relationship between variables. We start to use JMP with the release number seven. So we have a lot of trials to show you. We start with the ancestor. Bulk and Tapped Density. The bulk density of the powder is the ratio of the mass to its volume, including the contribution of interparticulate void volume. Then, the sample density is increased by mechanically tapping. Because the interparticulate interaction influences the bulking properties of the powder, but also, the interaction that interfere with the powder flow. So a comparison of bulk and tapped density can give a measure of the flow properties. As comparison is often use an index that speak of the ability of the powder to flow, this index is the Carr index or compressibility index, calculated with the tapped and bulk density. And here, we have a ranking of flowability related to the Carr index. We started taking volume measure by end, after the 5, 10, 15 taps and so on, and recording it on a data table, on the JMP data table. After, we try to use a sensor, a light sensor, like these, that can measure the distance of the powder from the top of cylinder and record and then store the results in a data table, a CSV file, a comma-separated value file. The results are quite the same. Data, whether if they come from an automated or a manual data entry fit well with the appropriate Kawakita equation. This is the Kawakita equation plotted with the nonlinear equation platform of JMP. This equation explain how the powder settled during the tapping. And it has three parameter. The first one is the bulk density. The second one is the Carr index. And here, we can see 22 value that is not a good property flow of this powder. And this one speak about the speed of the settling of the powder. I think this is all for this first data acquiring. Again, we have another instrument. Now, we go to speak about topic form as cream or gel. Geological properties are important for topic dosage form because the viscosity influence the production, but also the packaging or the usage of a topic product. You can think to spreadability on the skin. So a proper flow characterization is a fundamental importance during the development phases of a topic form. Nowadays, flow and viscosity [inaudible 00:05:17] increasing or decreasing shear rate are simply obtained using automated equipment. Here, we have the picture of a 30-years-old rheometer that we had in our lab. And we had a dedicated computer system. But data can also set manually on a logarithmic paper. Or simply, we can do a data table, a JMP data table. And with a graph builder, we can have the same output of flow curve or the linear regression using data transformation and access transformation in a logarithmic. But very important, it's this because using the bivariate platform and using the spline fit, we were able to predict to get an estimate of the shear stress when the shear rate go near to zero. This is the yield stress or yield point. It's very important because it's... How to say? It's very... The maximum stress below which no flow occurred in the system. So the maximum stress below which the cream or the gel don't move. This is an important information when you plan a volumetric filling of a fluid material. And that was not available when we had all the instruments. So JMP was very useful to understand the behavior of our topic form. Come back to the solid oral dosage form, looking at single-station bench top tablet press. This tablet press is ideal for R&D development, for research and development because very often, only small samples of active ingredients are available for the first testing. With these, we can set, we can control independently compression force and weight to meet the tablet requirement and specification. It works with the same tools that are used in the manufacturing scale press. And we can plot tableting and formulation characteristic in order to eliminate or mitigate some potential tableting deficiencies. On this model, there is no automated data collection, so we simply enter data in a file, preferably in a JMP file. There is the data table. And we can get some important plot that are the compressibility, the compactibility and the tabletability plot. They speak about the behavior of the formulation under compression. And in this data table, we have some equation that relate the compaction pressure that we applied to the formulation and to varied characteristic, to the tablet's characteristics. With a study that cover from 50 to 300 megapascal, we cover theoretically all the compaction pressure that can be applied on tablet, on pharmaceutical tablet of every formal sites. Now, we close some one of these. And then we go... ... to the moisture analyzer. The moisture analyzer is a balance. It's a balance that heating the sample by an halogen lamp or infrared lamp. It can measure its moisture. It's also known as loss on drying or LOD because heating the sample, the sample lost its moisture, and we can record measuring the weight, the change of weight. It's important to know residual moisture of samples whether they are granules, powders or of other. But could be useful also to see the rate of loss as a function of time or a function of temperature. The analyzer is equipped with an algorithm software that collect data in XLS file. Import in JMP, it's easy. It's very easy. We need only to set the number where data start, the number of column where data start and the row, and click on Next. And after, on Import. Here, we have the data table with time and loss on drying. We have to adjust something about column name and so on. But here, we have the same data cleaned. And we repeated the measure three times. So we have three replicate, but we stuck it on the same column. And with Fit Y by X platform, we have the function that relate loss on drying with time. And I don't save it. You see there is a dedicated hardware and a dedicated software, but I also try to do a script. Do a script to capture the data. But it's partially work, but I never fix it because the simpler and more effective way to collect data is import the XLS file. But more or less, we can do the same: opening a new data table, define the column that we want in the data table, and put the JMP to wait data from the Port com3. That's all. The oral route of drug administration is the most convenient for patient. So tablet is the most popular solid oral dosage form. And so I speak a lot about tablet press. In the manufacturing, we saw a single punch press, but in a manufacturing environment, we use rotary tablet press. And also, we have in the lab, a rotary tablet press that has a large number of punches. Our press is equipped with strain gauges to measure compaction force, ejection force, and also the force needed to detach tablet from the punch surface. All these data are displayed and recorded by a software. They are very useful to monitor, to study the tableting process. Normally, software display data, but use also them to do a real-time weight adjustment of the process. But our version, the lab version give also the way to analyze the single event, get statistic, print report, export report, and so on. Here, we see some screenshot from our software. Raw data come as a text file, a txt file. The reason to open with JMP Here, I have the txt file. I try to open it using JMP. I select this All files. This is the text. It's better remove this one and choose Data (Using Preview), and go to Open. Here, we have the data that come from software. Data start in one, two, three, the fourth line. So we correct it. And click on Next And this is as character. No, we need the numeric data. And this is the same. Column three is the same of column one, so I exclude. This is the compression. Sorry because the software is in Italian, but it's compression force. This is, again, the time. So I deselect. This is the scraper that measure the first to the detach the tablet from [inaudible 00:16:35] surface. This is, again, time, and I deselect. And this is ejection. Click Import. And I get my data table. I have to correct this again, but nothing of difficult. Every data become numeric data. I have elaborated the JMP file, already JMP file. Here, using graph builder, I can show the process. This is the compression for one tablet, this is the ejection force, this is the scraper force. In the original software, I have on two different page. Here, I have in a three, the same page, the three variables but it's not changed. Now, I select one tablet. But if I enlarge the X axis, I think 30 seconds, not more to record. Here, I have the whole process that I recorded. I have also found the software another report that record the peak values of each variable. And this is can be useful for these I import on JMP. And here, I have for every punch, for every station, for every rotation, the maximum force of compression, scraper and ejection. For example, here, we changed the lubricant of the formulation that decrease the ejection force. And we can see the effect of this change on the three variables. This is the ejection, and there is effect on the change. And now, I close it. And we go to NIR process monitoring. Here, in the lab, we have NIR spectrophotometer, but there's a small form factor. You can see it can stand in the hand, and a Wi-Fi connection. The main characteristic is flexibility in installation of various equipment. The most common application is blend monitoring. Generally, tumbling mix for powder consists in a container, rotating on X axis, like this. And the most common are cubic-shaped container, and they are called bin. The NIR is mounted on the bin by a three-clamp flange. And during the mixing operation, the instrument collect a spectrum of the powder at each rotation. As the mixing will go on and the system become more and more homogeneous, the spectra become more and more similar to the previous. To get the most from the NIR data, it's mandatory to apply chemometry. And, of course, we do it with NIR software. But it's also possible to export data in XLS file and go in JMP. Importing data from XLS file is very easy. It's simply to open it. Here, we have the file. We can give a look, the raw data file. We have for every wavelength of NIR spectrum, we have the value of lab solvents at every rotation. The whole processes was 80 rotation of the bin. We can give a quick look to the graph, to a spectra, selecting the wavelength, putting in the X axis. Select, and Parallel Merged. Here, we have the spectra of each rotation. I said that normally, we have to apply chemometry to NIR data to get the most information, the more information as possible. A pretreatment normally used in this type of analysis is standard normal variate pretreatment that is a normalization of a spectra, subtracting each spectrum its own mean and dividing by standard deviation. Here, I do a script to do with this pretreatment. This is the file of raw data. We can run the script. Select the wavelength, and get the new data after predicting. With the graph builder, we do the same. Parallel Merged. Here, we have the difference of spectra, of raw spectra and... ... sorry, if I found it... ... Raw spectra. This one. And predicted spectra So here, we have the difference from raw data and a pretreated data. But we can see better in this file where the spectra are colored by rotation, from red to green. And you can see that the very first spectra is the red one here. And the last spectra are the green ones. And they are more and more similar one to each other respect to the red and yellow spectra. We can see these also with the principal component analysis. This one is the first rotation, and so on. And we can see that spectra become more and more similar and principal component analysis are more and more the same when we get the end of the process. Another way to see the end of a process is... the moving block standard deviation. Here, we see a plot from our NIR software. I will show you something in another windows about it. The high shear mixer. The granulation process is of real importance, really important step in pharmaceutical manufacturing. Granulation improve physical characteristics of mix of a powder as flow properties and content uniformity. Granules can be used as is in a delivery form like sachet or a stick pack. But they can also be pressed into tablets. High shear mixer are a key point able to do a wet granulation, in granulating powder by the binding solution and the shear force due to a rotating impeller. After, the wet granules will be dried in another step. In our high shear mix, we can control by software every process parameter: the speed of impeller, the speed of chopper, the rate of addition of a binding solution. And moreover, the software correlates some variables as part of temperature or power consumption. And data are stored in a CSV file, so it's very easy peasy to import in JMP. Here, but is data imported in a JMP data table. I have some column that I colored in yellow that come from some calculation about water amount added and so on. But it's important to see that with graph builder, we can see the whole process, the parameter of whole process. For example, we can see the torque measured by the software during the wet granulation and during the massing time; you can see how it changed. Here, in this picture, you can see that our NIR instrument was also fitted in this step, in this process to monitor the granulation process. Here, we have some results, always come from the software of NIR. We give a color to identify each phase of granulation. With the principal component analysis, we can see the start of the granulation. And when we added water, there is this change on the physical properties, on the physical aspect of the granules that become more and more wet, more and more agglomerated until the end of the process. Here, we have the same data in the transpose matrix to get a better visualization on graph builder. And we can see the spectra of a different phase. Here, we have our very first time of granulation when we are mixing the powder without adding water. Here, we have the adding of the water and the final massing time. And we can see that the peaks are changing. Here, we have the maximum absorption of the water in NIR spectrum. Here, we have the start, the medium point, and the end point of granulation. All this data can be resumed in a journal, like this, where we had highlighted a variation of a peak, independence of a single step of the process. Here, we have another two equipment, the fluid bed and the tablet coater suite. These are important, very important, keep maintain pharmaceutical manufacturing. We have a particular suite that is made from three units. The main control unit that is the same of the two process, and the other two units that are interchangeable for the other purpose of the process. We start with the fluid bed. The fluid bed is another way to get the wet granulation because wet granulation is not done only using a high-shear mixer, can be done also using a fluid bed. The fluid bed technology mean that the powder to be granulated are suspended in keeping motion by an upward flow of heated air. A binding solution is sprayed on the suspended powder, and the flow of area remove result during the whole process. The onboard software give us a total control of a process parameter, and every relevant variable is collected. Reports are stored for future analysis, and it's possible to export as PDF file. Here, we have a PDF file, reduced PDF file for the purpose of this presentation. I try to open in JMP. And here, I have this one. I know that in the first page, there isn't a table that I'm interested to import. So I click Ignore tables on this page. In the next one, this little one is not of interest, so I ignore it. And I ignore it also very last one. Here, there is graph of data, Ignore table on this page. Here, I have a preview of the table that I'm interested in. I click OK. And I have my data in a JMP data table. Something to fix as numeric instead of continuous, but it's not... ... of chart, but it's not a problem. Now, we see to go coater system. The coater system is to do a tablet coating. The tablet can be coated for several reason. The coating can have a specific function, for example, delay the drug release or simply it's need to reduce the dust during the packaging operation or have some cosmetic need as masking some bad taste of the tablets, and so on. Whenever the coating is done, spraying a coating suspension on tablet rotating in a drum. A flow of heated air remove solvent during the process. As the software is the same with the fluid bed, they have the same control unit, data and report are the same that we can get from the fluid bed. So here, we have a JMP data table obtained from the PDF file that we see before. Always using the graph builder, we can give an overview, a look to the process. Here, we see the spray rate or the product temperature and so on. Also, here, we try to do some application of NIR measure. We simply put the instrument inside the pan during the coating and taking a measure of weight gain of the tablet at various time interval. Here, we have the spectra, how change the spectra during the process. And we can see here that data are pretreated. That is a first derivative treatment on the raw data that are able to increase, to highlight the peak spectra variation on... ... on the report. Here, we do a relation between spectra sample and weight gain measure during the process. And so we have a relationship between spectra and weight gain. And for the next batches, we will able to predict the weight gain of the tablet simply take spectra during the process. Now, we close it. And it's enough. At last, we come back to the topic dosage form. The laboratory reactor that we have is useful to optimize process as mixing, homogenizing, dispersing in a lab scale. The system can be adapted quickly and easily to wide range of application. The main use for us is to do topic form as gel or also cream. It has integrated scale, pH and temperature sensor. The onboard system allow process control, display the process graph, but it's also able to store every processor relevant data in a PC as an XLS file. So it's simply for us to import in JMP and always using the graph builder. It's easy to see the whole process. For example, we can see the speed of dispersing system but also the torque or the viscosity trend of a system of a gel during each phase. Again, we try to use the NIR spectrophotometer to monitor the process. For this application, we add to the reactor a recirculation stream. And the NIR was placed on it by an appropriate flange. Data collected and elaborated with the NIR software can be imported in JMP. So here, we have the spectra. For example, spectra collected during the whole process or the principal component analysis. And we give a name to each phase, so we see by this local data filter. The first step when we do an aqueous solution of the base, when we add the active ingredient, when we had the ethanol or the gelling agent, and when have the gelification of a system, and when we have the finished product. We see here 19 spectra that are all in the same point of the principal component [school] plot. I spoke before about moving block standard deviation. A moving block standard deviation is simply a standard deviation of a block of spectra. And comparing the standard deviation of a current block with the previous, we can see how the variation of a system changed. So when the system become more and more homogeneous, the moving block standard deviation become more and more similar. Here, we have a plot. And we see the same that we have seen with the principal component analysis that are the five phase of aqueous solution, where are the second phase where the active ingredient is added, and also where the ethanol, the ethyl alcohol is added, and the gelling agent, the gelification, and finally, the finished product. And the moving block standard deviation become very, very similar and very, very quite to zero. Well, I think we have seen enough; we have seen a lot of process and a lot of equipment. They, of course, have software specially designed to control, to collect and to analysis process data. These software are not replaceable from another. They are important to control and to drive the equipment. But every system is standalone. So sometimes we can't use the equipment to do that analysis because it's busy with another project. Or sometimes we need to merge data from different step to have more global overview of the product. So we can do easily using JMP. Just import file. I thank you and goodbye.
In the journey of delivering new medicines to patients, the new molecular entity must demonstrate that it is safe, efficacious, and stable over prolonged periods of time under storage. Long-term stability studies are designed to gather data (potentially up to 60 months) to accurately predict molecule stability. Experimentally, long-term stability studies are time consuming and resource-intensive, affecting the timelines of new medicines progressing through the development pipeline. In the small molecule space, high-temperature accelerated stability studies have been designed to accurately predict long-term stability within a shorter time frame. This approach has gained popularity in the industry but its adoption within the large molecule space, such as monoclonal antibodies (mAbs), remains in its infancy. To enable scientists to design, plan, and execute accelerated stability models (ASM) employing design of experiments for biopharmaceutical products, a JMP add-in has been developed. It permits users to predict mAb stability in shorter experimental studies (in two to four weeks) using the prediction profiler and bootstrapping techniques based on improved kinetic models. At GSK, Biopharm Process Research (BPR) deploys ASM for its early development formulation and purification stability studies.     Hi, there. I'm Paul Taylor. I'm David Hilton. Yeah, we'll be talking to you about predicting molecule stability in biopharm aceutical products using JMP. We'll dive straight into it. Oh, yes. One thing to mention is just a little shout- out to this paper that's been published, so Don Clancy of GSK, which most of the work has been based on. Just to introduce what we do. We're part of biopharm process research, and we're based in Stevenage in UK in our R& D headquarters. We're the bridge between discovery and the CMC phase, so the chemistry, manufacturing, control. That would be the stepway into the clinical trials and also the actual release of the medicine. We have three main aspects. We look at the cells. We are looking towards developing a commercial cell line, the molecule itself by expressing developable, innovative molecules and the process in terms of de-risking the manufacturing and processing purification aspects for our manufacturing facility. Just an overview is to see why do we need to study the stability of antibody formulations? How do we assess the product formulation stability? The overview of the ASM JMP add-in that we've got, and the value of using such modeling approaches with the case studies. Just to reiterate about biotherapeutics, so these are all drug molecules that are protein- based. The most common you'll probably know are all these vaccines from COVID, so they're all based on antibodies and other proteins. These are very fragile molecules. The stability can be influenced by a variety of factors during the manufacture, transportation, and storage of them. Factors such as the temperature, the pH could cause degradation of the protein. The concentration of the protein itself can have a significant impact. The salts to use, the salt type, even the salt concentration, and even subject to light and a little bit of shear can have an effect. What's that cause is the aggregation of the protein fragmentation. It can cause changes in the charge profiles, which can then affect the binding and potency of the molecule. That could be caused by isomerization, oxidation, and a lot more. They're fragile little things, but also we need to keep an eye on the stability to make sure they are safe and efficacious. One way of looking at the stability is by subjecting to a number of accelerated temperatures and taking various time points. These long term stability studies can go up into five years. It can take up to 60 months. These will be more real time data. The procedure is extremely resource intensive. At each time point, we can use a variety of analytics such as HPLC, mass spectrometry, so essentially separating the impurities away from the main product itself, the ones that could cause the problems and then quantifying free mass spectrometry, like scattering or just UV profiles itself. We can separate by charge, size, or [inaudible 00:03:21]. Within that five years, when we're gathering that data, a lot can happen in those five years. We could have better developments in the manufacturing or the formulation. Does that mean that we have to repeat that five- year cycle again to get the stability data? Short answer is no. What we can do as an alternative is look at accelerated their stability studies. These are more short term studies, so we can apply that more exaggerated accelerated degradation temperatures and we can use shorter time periods. In a matter of months and years, we can now look at a matter of days, so we can go from seven to 14 to 28 days. This technique is commonly applied in a small molecule space, but not so much in the large Bison super space, because of the small molecule space. They involve a lot of tablets and solid formulations, and it's only starting to hit trend in the bioph arma industry with more liquid formulations. In terms of the stability modeling, we base our data using Arrhenius kinetic equations that can be both linear and exponential with respect to time. These are semi- empirical equations based on the physical nature. For example, accelerating is if there's a nucleation point for aggregation, it could cause an exponential growth. Conversely, when you look at decelerating, when there could be a rate limits in step, it could cause a slow growth in the degradation, too. All of these models are fitted and performed on a fit quality assessment through evasion information criterion, but also we can establish confidence intervals as well using bootstrapping techniques. This is where David Burnham at Pega Analytics. He's worked closely with Don Clancy on developing a JSL script or a JMP add-in for us scientists to use in a lab. Just going to give you a quick demonstration of the JMP add-in itself. If I can exit. During our ASM study is you collect your data, and obviously you put it into a JMP table. In this instance, we're looking at the size exclusion data, so we're looking at the monument percentage, the aggregate, and fragment. What we got is a JMP add-in that will give us the fitting. If I just quickly open it up. You go through a series of steps, so you can select the type of product. In terms of small modular space, where they're dealing with solid tablet formulation, you could use a model based on Aspirin, which is more of a generic approach, or a generic tablet where it's more novel and different to Aspirin. But for us in biopharma, because it's all liquid formations, we look at the generic liquid. We're good. Okay. Inside in the data, so just that data table I showed you earlier, you can also do a quick QC check. It's just a fancy check that everything matches up, and ensure that everything is hunky-d ory, or you can just remove it and replace it with a new data table. The most strenuous part of this JMP add-in is to actually match the columns up. In terms of the monomer, and aggregate, and fragments, those are the impurities that we're interested to model. We'll put those in the impurity columns, as well as matching up all the other important aspects such as the time, temperature, pH, and also batch which is something that could be of importance. If you have a molecule that has different lot releases and you want to see if all your lots are consistent, this is one tool that you could possibly use to ensure that your lots of lot variability is consistent. Al so to ask is if you have specifications, so if we have a gold target of having no more than 5 percent drop in monomer, so we're doing 100 minus the monomer in this case. But in terms of aggregate, we have no more than three percent, and for fragment, no more than two percent. In terms of the model options, we can select all the models that we want to fit and evaluate and also select just at a generic temperature and pH you would like to look at. You can choose that later on as well. That can be flexible, and also the different variants like the temperature only, and temperature, and pH. To save looking at fitting the data itself, you can either go for a quick mode, which could be a two- minute quick fit, which won't be as accurate as maybe the long term mode. But to save you all from looking at spinning wheel of death, I've already fit the data and then we can go straight into it. We can fit all the models. Once it loads up, eventually, you can have an overarching view of the prediction profilers of each type of model that's been fitted and evaluated. You can see that some have a confidence that are a little bit broader and some are a little bit tighter, so it can be either the model doesn't reflect very well in it or it could be overfitted. For scientists, we can then delve into selecting the candidate model. This is where it's based on the Bayesian information criterion. Apologies. You can also look at those two criteria and see which one is more appropriate. That you can use the drop down to see how the model fits in with the actual predicted values. Then last but not least, if we look at when you select the preferred model, this can give you the override in... Here you go. You can manually override which model you'd like to choose. But also, here in prediction profiler, is you can select the conditions you would like and extrapolate from it. Beyond the one month period, you can extrapolate all the way up to two years, and you can see how it fares in terms of its stability. One last thing to add is bootstrap techniques. If you want to find the control of how the bootstrapping is working and to get more accurate modeling of the confidence intervals, you can simply do that to each of your impurities. Trust it not to work. Okay, time to completion. It's done. Okay. You can see that you can look into it in a great detail. Okay. We go back to the presentation. We'll swap these. Okay. We'll be going into our first case study, which is looking at the stability of our formulations. They play an important role in drugs in general. They, not only just helping biopharmas in terms of stabilizing the protein during the storage or manufacture, but also it can help aid the drug delivery when subjected to a patient. Formulations contain many components, so they're called exceptions. These are generally inactive components within the drug product itself but they act as a stabilizer, so they could be the buffers, some amino acids, stabilizers, surfactants preservatives, m etal ions, and salts. But in terms of the formulation development, you can screen many of the excipients to find the ideal formulation and you can use design and experiments, and that's going into a different topic. But one way to test and prove that your final formulation fit the purpose is by doing stability testing. Our case study, we've looked at three different pH of formulation for this monoclonal map, and we stressed them at elevated temperatures. We looked at our ASM study from time zero all the way up to 28 days and analyzed them through size exclusion. A s you can see, it's the same snippet of the JMP add- in, so you can see how the models are being fitted, but also the extrapolation from the prediction profiler to see how the monitoring stability fares. When we look at the mon omer and aggregate, you can see we can take predictions from that prediction profiler at 5, 25, 35, 40 or even other temperatures. But within that model, we have an N1 value, and that can reflect on how fast the degradation is, either in acidic or basic pH. What we found is a minus value. It has faster degradation in acidic pH. We found that there was a higher risk of low pH rather than a higher one. Our next case study, which is on a similar trend, is looking at a different m onoclonal antibody, where we use its ASM stability study up to 28 days. With that same molecule, we had some historical data, which had five years worth of stability data. What we did is just taken the data from both studies and put them into JMP add-in and see how they compare. What you can see is highlighted in green, you can see that that is the model prediction. In the bottom is long term study, the real time study. In blue, you can see that the values fare pretty well. Then in red is the confidence intervals. They match up quite nicely, which is good. But one of the downsides to ASM because it's the short term study, if you look at the graphs itself, they seem to be quite linear. Whereas in real time data, they seem to be a bit more curved and exponential. But in terms of getting that data back and that actually prediction, it's quite good. That could help with some immediate formulation development work that you need to do rather than wait for long term stability data, I'll pass it on the David. Thanks a lot, Paul. Paul's given an example of how we could be looking to long term study stability predictions based on a month's worth of data. What we intend to do though is you're intending to design your formulation in order to hit a certain minimum threshold for stability time spent. In this case, what we need to do is having a fixed formulation and we're then trying to use this technique to find out what period of time does the product stay within a device for each threshold, and therefore what do linear in terms of how long we can hold this material? The material being in question is material that's been generated during the dashing manufacturer of the bio therapeutic. Essentially with this, you've got different unit operations, which are linked in series. What happens is your complete volume operation and then depending on shift patterns or utilization of your facility, it may be a case that you want to have holes in between different unit operations in a way in order to regulate timings of your process. One of the key things that we need to have here is to know that if we're holding our material in between the unit operation, what's the maximum period of time we hold it for material [inaudible 00:14:57]? The way that we normally look to do this is we have a plot, something like that, on the lower right hand side, where we just hold the material a month in a small scale study and just do repeated analytic measurements of our product quality interest to see how it changes and whether it falls within tolerances. In this case, you can see that we're looking for total 100 percent over time on the X- axis. In all cases, it's falling within those red bands, so we can say it's [inaudible 00:15:32] criteria that we are after. What we're looking to do in this study is we were rather than looking at the standard conditions that would be exclusively the standard conditions that the material could be held at, so that would be 40 degrees C. It was refrigerated or approximately 25 degrees C, so room temperature. What we're looking to do is to have parallel studies performed at 30, 35, and 40 degrees C, but only for a week, and then see if that high temperature data could be used to predict the low temperature data. What we've got in this slide is we've got a few snapshots from the data that we collected, which has been visualized within the graph builder. The person the panel box on the left hand side, you've got the data reflected at five degrees C and the columns in there represent material coming from different unit operations. Each row corresponds to a particular form of analytics that's being deducted. For example, if you measure the concentration of the level of [inaudible 00:16:38] species. On the right hand side, we've then got the equivalent data for equivalent unit operations and analytics, but just a higher temperature. Just doing basic plot like this, the first thing we can see is that the general trends seem to be consistent. If we were to look at the purple plots, in this case, you can see that the first column, so the first unit operation, we've got a descending straight line, whereas the second and third unit operations, you have a slight increase. Qualitatively, it looks quite promising in that an increase in temperature isn't causing any changes to the general trends that we're observing. In terms of a more positive prediction, this is where we then began to use the ASM add-in. In this case, what we've done is we've taken the 30, 35, and 43 degrees C data and we then use that transmitter model. In terms of the model bit quality, you can see from the predicted versus actual plot in the center that it appears to fit quite well, so that's reassuring. If you look at the model fits in the table on the top left hand side, we can see that the model fit with the lowest BIC score with [inaudible 00:17:57] model. This study was that we didn't have pH as a variable. That's why the BIC score for both of those is the same, because we're essentially removing that parameter. The linear genetic model and the external linear model are essentially equivalent. What we've then done is use this high temperature data to fit this kinetic model and determine what these kinetic parameters would be, so it's K 1 and K2 in the kinetic equation, show n on the bottom left of the slide. We've then changed the temperature value in that to five and 25 degrees C and try to predict what level of degradation we'd expect at that temperature and over a longer period of time. This is what's shown on the right hand side. We have the red lines corresponding to predictions from this equation based on the high temperature data, and that's then fit into the experimental data at lower temperatures just to see how good the prediction is. In this case, you can see that actually the predictions appears to be quite good. It gives you cause of comfort in this case. Sometimes, however, we noticed that this wasn't always the case. In this example, again, with the high temperatures, we've been able to fit... Have good model fittings which are predicted versus actual fit is quite strong. In this case, however, the model fit has been stipulated to be the best, in this case, is the accelerating connect model, so indicating that the reaction rate's getting faster over time. We then apply the same procedure to this set of data and we start trying to model what would be happening at lower temperatures. We can begin to see that the prediction is a little bit erratic. In reality, the increase in the level of this particular purity was fairly linear. The model was predicting that it was beginning to overestimate. It's quite a drastically high time points. I guess one thing that's important to bear in mind with this is that you also need to have an incorporated level of subject matter knowledge when applying these kind of technique techniques also. You need to have the balance between what's the best statistical model in terms of fit, but also what's the most physically representative of the type of system that we're dealing with. In terms of subject matter knowledge, another thing that's an area where it's important within this technique is the selection of the temperature that you can use in this study and the temperature range that you're going to look at. There's two reasons for this, and you've got competing forces. It's preferable to use as high temperature as possible because that means that the reactions in exceeded a faster rate. One of the issues that we often encounter with this type of study is that at low temperatures, fortunately for us, we're often dealing with products which are quite stable. But the inherent problem with that is when you're trying to use quite short time, quite narrow time series later in order to measure these changes, you often end up getting caught in the noise and your signal to noise ratio ends up being quite low. That's what's being demonstrated within these plots. The plots at the left hand side, are a kinetic rate fits, or reduced plots. What you typically expect is that an entirely temperature driven behavior, you'd have straight line. If you look at the top left plot, you can see that that's the case. We've got the first four blue points, which are forming straight line. That corresponds to 40, 25 degrees C. Across all of those temperatures, we've got entirely temperature preventing that behavior. But then the five degrees C point on the right hand side of that plot appears to be off. But when you dive into the data and find out whether this was because it was not available in that temperature dependence, if you look at the equivalent plot on the right hand panel, you then begin to see that actually because there's so much noise in the data that it's more of a fitting issue rather than a mismatch issue. If you were to look at the gradient of, say, one of those red line, blue line, despite the fact that the intercepts can be different, they can easily fit that data because there's just too much noise in the data to really be able to fully understand which one should be applied. In terms of general conclusion, though, I think what this is for this project we've been able to demonstrate is that JMP itself has a number of powerful built-in tools and with lots of knowledge of JMP scripting language or someone who can do this for you, those can be compiled in some form of user friendly package, which can then be used for quite complex analysis, which makes it accessible to most users. It's also demonstrated that by performing statistical fits to semi-empirical models, we've actually got a lot of tangible benefits from that and that we're able to make predictions about the future which in the past, we've not been able to do, and potentially, significantly reduce our timelines in terms of identifying liabilities with particular drug products. Frankly, this also demonstrates the importance that you can't be in areas such as this. You can't rely exclusively on statistical models. You also have to incorporate with your own subject matter knowledge. Try and work out which statistical model or kinetic model, whatever it might be, is the most appropriate to the situation you've got, and then which of those is the best fit. In terms of acknowledgements as also getting a lot of this work has been based on an original paper, which came out of GSK by D on Clancy, Neil, Rachel, Martin, John. This has been extended, so a biotherapeutic setting. We also thank George and A na for the supplying data which has been used to build this project, and Ricky and G ary for the project endorsement [inaudible 00:24:20].
Resources in process development for fermentation processes are always limited, often because of time constraints or the limited access to fermentation vessels. For this reason, the optimal usage of the data collected is of utmost importance. This presentation shows how to maximise insights and understanding of the process by modelling data collected offline (like product titer and parameters of downstream processing) and online sensor data (like pO2) in context of the Functional Data Explorer platform. The modelling approach is based on constant and functional factors and responses. Because of the limited design space, the new extrapolation control for the profiler plays an important role here.   Furthermore, this presentation highlights the advantages of using Graph Builder to lead a team and subject matter experts to a faster, easier and more efficient understanding of the produced data so that they may make better and more reliable decisions in the fields of bioprocess development.       How to make more from your online and offline fermentation data and how to speed up your bio process development with statistical modeling. I am Benjamin Fürst from Clariant and today I will show you how to do this in JMP. I will lead you through a hands-on presentation, how to combine different modeling techniques so that in the end you have a combined profiler where you can look at all the responses online, offline, at one time and see what impact your process parameters have. I am a biochemical engineer by training. I work for Clariant at the Group Biotechnology and I'm group leader of Bioprocess design. The agenda for today will be introduction for Clariant and into the process technology sunl iquid. And then I will dive into the topic of my talk and go to the statistical analysis of fermentation data. For the modeling offline responses I called it standard statistical modeling. I will use the Fit Model platform and will especially focus here on the extrapolation control which came in with JMP Pro 16. Next level analysis will be modeling the online data and I will use the same data set and for this I will use the Functional Data platform, also a JMP Pro feature. Let me share some key figures about Clariant. Clariant is a global leader in specialty chemicals. To the right you can see the Clariant business numbers. We have a 3.9 billion Swiss franc sales in 2020 with a 15% EBITDA margin and over 13,000 employees worldwide at 85 production sites. Clariant consists of three business units: care chemicals, catalysis, and natural resources. Care chemicals. They produce, for example, ingredients for shower gels and shampoos, which you use in daily life. Natural resources. They produce, for instance, bentonites. 40 percent of all vegetable oils are clarified with bentonites from Clariant. Catalysis. Of course, these are all kinds of catalysts and they also contain the business line biofuels and derivatives. This business line sells sunliquid. What is sunliquid? Sunliquid is a technology which came from the Group Bio R&D Center, where I work. Basically, it's a biotechnological process for producing bioethanol from non-edible biomass. How does it work? Just to stress this again, that's the second- generation bioethanol production. So we use non- edible biomass feedstock. So basically agricultural residues. This can be wheat, straw, bagasse, corn stover, municipal waste or forestry residues. We put that into the sun liquid process and we first turn them into cellulosic sugars and then to cellulosic ethanol. The nice thing about this process is that it can be used as a platform process for not only by bio-ethanol, but for other sustainable fuels or even bio-based chemicals. The development of this process started in the Group Biotechnology where I work. You can see if I'm not in home office due to corona, I'm working here at that corner of the R& D Center in Planegg. That's located just a few kilometers outside of Munich in Bavaria in Germany. The center was inaugurated in 2006 and we have over 100 scientists and technicians working there. The competence fields of Group Bio are biofuels and derivatives, sunliquid, industrial enzymes, and biobased chemicals. Within the Biotech R&D Center, we have all the expertise to develop new products and technologies under one roof, starting from small tiny microtiter plates over a shake flask to up to 100 liter bioreactors. Here in the picture, you have a small peek at our technical center fermentation. To the right here, you can see the sun liquid pre-commercial plant in Straubing, which is just one and a half hours from here. It was built in 2012 and can produce about 1000 tons of ethanol per year, and we were able to test the wide range of different biomass feedstocks there. Maybe some of you are aware that Clariant is commissioning a commercial- size bioethanol plant in Podari, Romania at this time. To say, as a biochemical engineer, seeing this plant, that is really, truly, impressive thing. The size of the buildings and the equipment there are, in biotechnological means, really tremendous. And I see this as a real flagship for biomass conversion here in Europe. And to my opinion, that is one of the major biotechnology projects we have in Europe at this time and it really makes me proud to see that we are turning our sun liquid technology from bench scale to production scale. The plant transforms a quarter of a million tons of wheat straw into 50,000 tons of bioethanol per year. And to give a more handful number, that means one huge bale of straw, about 500 kilos goes into that plant per minute. The sun liquid technology enables you to produce bioethanol in a very integrated way. Side products like lignin are reused in the CHP plant for energy supply.` So generating steam and power which goes back to the plant, of course. If you're curious to learn more about that technology, please look at the links or feel free to contact me or reach out in the contacts which are given on the home page. Now, why I'm showing you all of this? What do all the stages have to do with each other? At all stages of the development of our biotech process, you have data about the process. And it doesn't matter if it's a micro titer plate, a one- liter bioreactor, or a multi- cue production scale fermenter, you always have offline data like titers you measure, and you have online data coming from sensors, pressure, temperatures. And you always want to know: how can you achieve the most efficient process? How can you achieve the high yields? What the process influences in process parameters? And what sensor do you have to pay attention to to get the most of your process? All these points can be addressed with statistical modeling of your process, and JMP really gives you the opportunity to make more from your online and offline fermentation data and speed up your biotech process. Let's go. Since my talk is going to evolve about data coming from a fermentation process, I want to put everybody on the same page concerning how does a fermentation process look like. This is a basic overview of the fermentation process with DSP. I will focus on the main points here. So you come in with a test tube that's a few milliliters. You propagate your organisms in a so- called seed fermenter. And then you do, in the next scale, the fermentation, the main fermenter where the product is going to be produced and then you go to a downstream. The setup here really depends on what the specs of your product are. All along the process, you have different steps and you have factors, which are set for this step. On the top here, I show just some examples. So seed inoculum amount, how much biomass you put in here, pH, temperature setting of the fermenter, or even other humid operations, and the raw materials you put into the process: which type, at what concentration. And all of them here are constant, they have a fixed number. So I depicted them here with dots. And of course, you have responses. The responses are what the outputs of your process. This can be a harvest yield. So how much product you going to get in your fermenter or how good does your DSP perform in terms of some specific response you want to look at. Those are constant, again, because you just have a single point and you have online sensor data. For fermentation, you usually look at something like a dissolved oxygen called pO2, temperature, pressures. And the important point about that sensor data, it is functional. So you have your sensor data over duration. You have a curve of data. And I can really tell you if you're interested in knowing, how is everything connected there? How can I make the best of all of the data you have? And normally starts if you want to know from your fermentation parameters what is your response you're going to have, your offline response something like a yield product character. And maybe you want to know from a fermentation parameter how does a typical sensor curve look like so that you know how to go with your process in a good way. And last but not least, wouldn't it be great, here for the last point, knowing at one time point, online sensor has a critical impact on your harvest yield? Using standard software you will really not get far. You're probably going to lose your focus looking at all the points and JMP can really help you in getting those things done. In the following I want to show you, with a live demonstration then, how you can do all those things in JMP. There are several videos in the community really how to use the Fit Model platform or the FTE platform, but I'd like to combine these platforms to put all the results together. So standard statistical modeling with a Fit Model platform and with a focus on the extrapolation control. I want to use the fermentation parameters and model the harvest yield. That is where we're going to go for the first point. Normally the goal of doing this is, you want to know what process parameter give a high harvest yield. You want to find an optimum in the design space. Is that optimum stable? And what parameters have interactions? And what parameters are sensitive? And the way you do this, as I already told you, you're going to use the Fit Model platform. One important point here, I call it here detailed evaluation by and with subject matter experts. I believe that the real speed for process development comes from a mutual understanding among the experts you have. It doesn't matter if it's a fermentation engineer, data analyst, or a manager. Every expert must understand a little bit of the language of the other experts. So it's really crucial that the subject matters share a common language. And I believe using JMP enables you to speak that common language using the profiles JMP provides and the reports JMP has. Get back to the topic. This was called a real- life experiment, and the data was coming from a planned experiment and it was not intended for the usage of statistical modeling. That means the design space was not optimal. I put a plot of three parameters here. As you can see, of course, there was some kind of structured approach, but not in all dimensions, not in all parameters. You can see on this axis, we don't have so many points. So one of the challenges here was the limited design space in here, especially the extrapolation control. JMP Pro comes in very handy. Let's get into JMP. I'm going to load up JMP now. So that is my data table and of course already scripted the report. This is the profile for that model I chose. A t this point, I will not go into the details of the modeling. I assume you already looked at your plots. You've done, you chosen your models, the parameters for your model correctly. So how can you use this for your development? I think that's pretty obvious. You have here the harvest yield, which you're interested for your fermenter, and dependency of all of the parameters which I found significant for this model. And one way you can use this, I want to show you a very nice thing you can see with this model. If you look at the Seed Inoculum, that is basically the amount of biomass you put in the process. If you have a low biomass you put in the seed, it means you're not going to have so much biomass during your whole process. And you will see that this parameter alone doesn't have a huge impact on the harvest yield. But now if you see that moving, if you move that around, there is a strong interaction with component A in the main fermenter. So with less biomass in here you have, the more component A in the main fermenter you give in, the more products you will generate. You can basically say that's a limitation on component A that makes totally sense. Now the interesting thing is the interaction. If you going to put more biomass into the process, that turns around. So what happens here? I have to say you cannot see that here what happens there because you have a different limit. Now it gets to the point where it's interesting. You have to put more data into your model, for instance, online data. You're going to see that in a different sensor. That is something I will show you then if we go into the online data analysis. You can exactly see this behavior there. And just seeing this was quite obvious for me; an explanation of that was clear. But just the other day I talked to a monitoring and fermentation engineer and he just said, "Having that behavior for that other parameter and temperature, that's totally clear and we have to look at that." That's something that I talked about, that common language that I will be able, as data analyst, to understand some things but not to the same extent as the fermentation engineer does. So it's very crucial that everybody has access to this kind of looking at that. I wanted to talk about Extrapolation Control. Extrapolation Control comes with JMP Pro 16 and it can be found here. There are different criterion you can set. I set it here to the first one that basically is JMP to stay within the context of the data you have. If you want to learn more about that, there's a talk about control extrapolation by Laura Lancaster and Jeremy Ash. The videos in the community. They talk a lot about how you can use that and how are the statistical details behind that. As you already probably noticed, moving around the parameters, those traces here move that wasn't there before and that really gives you an advantage in using the model because you will not go outside the model, outside of your design space. You're not going to make a wrong conclusion from your data. And you can see even JMP splits data here if it thinks that is not sufficient data there to do some extracting data from that. So for limited design space, Extrapolation Control is very neat way to go. We're going to use that later. So we're going to need to save our prediction formula. Going to save it here, save columns, hit the prediction formula, and then it's going to turn up in your data table down here because we're going to need that. Going to get back to my presentation. So just to make a short summary. The models shown could show expected behavior and most important unexpected behavior and use that for next runs. The extrapolation control especially on limited design space comes in very handy, and we were able to find optimized parameters of the model in the design space. So we identified potential parameters for a higher yield for the next runs. Just a nice side fact. How to interpret stability. I normally prefer to use the simulator here. You can just add that to the profiler and also I like to use the contour plot especially when you have multiple responses. So not only the titer but imagine you have some DSP performance you want to put into your model as well and then you put them all in a contour plot and limit the areas where you want to have the process and then you have the target area where you need to run your process to be within all the specs. Especially here because of this limited design space, you have to emphasize, you have to verify the results with an additional experimental runs because you were on the limit. The optima we chose with the profiler are on the limit of the model. So you have to be sure that those are correct. But here in this case, we just use them to guide us in the correct directions for parameter setting. At this point, we have the profiler with the response of the harvest yield over the fermentation parameters. Now I already hinted it. Wouldn't it be great to have now all your online sensors on there too so you can see other effects as well? Let's go there. So the next step is online data analysis with a functional data platform. And there you are able to use the same set of fermentation parameters and model the functional data, the online response data and combine them with your model. We just use that. To put that a little bit in graphics, w e have the fermentation parameters, we have a whole bunch of sensor data. I depicted all the patches here. And from that data, we want to go to a profiler where we have the lower line is the model I just showed you and you also have other responses there. We can see what is the typical response of that set of parameters you've chosen. I drafted a short workflow here of what we're going to do. I'm going to show you that in JMP, of course. And we are basically going to use that we did already, the fit model platform, save the prediction formula. And now we're going to use the FDE platform, f it splines, do the functional DOE and save that again, and then put everything together in one place. Let's get back to JMP. Here's a set of data. I normalized all the data from zero to one. I would want to show you an example of the pO2 , that is dissolved oxygen. This is an important parameter in fermentation because microorganism in this case, needs oxygen. And so having sufficient oxygen all the time is important. The graph builder, essentially, if you want to get people to use JMP, I always like to stress out using the graph builder, because that's just the easiest way to put data together and make nice graphs. So using other software probably would have taken me to do this, not 30 seconds. And this is a very nice way to have all the overviews. And where do you want to go? We want to have another parameter here, the pO2 . Also, all the information we have in all those batches we want to collect them here and see what is the typical curve for this set of fermentation parameter we chose. The functional data platform is here under specialized modeling. You need your outputs, which are functional. I'm just going to take two here, pO2 and CO2. On the X, you're going to put the duration, and you have to tell JMP how to discriminate between the matches. So you're going to put your batch ID here. Now, it's really important to supplement everything you want to need about so the fermentation parameters, or even other things like the harvest yield. You want to know, you want to do analysis on, you have to put supplementary here at this point. If you forget something here, you have to redo a lot of stuff again. And then you have the first report on the Functional Data Explorer. JMP gives you here that's everything together. One graph that's not so nice, but here you can see all the data by Batch ID. Just want to point out, there are nice clean- up tools here. Have a look at them, use them. You don't have to clean up your data before, even you can fill your data here very easily. Now we have the data here, and now we have to build a functional model of the data. And JMP does that with doing splines. There are different kinds of splines you can use. I'm first going to go to the easiest one, B-S pline. That is pretty fast, and you can see what it does there. First, you're going to see here in red, the splines , which were done over the other user data. And the second, it does a functional PCA down here. So PCA basically is about reducing the dimensionality. And JMP produces a set of Eigen functions and a set of FPCs, functional principle components. And if you multiply each corresponding FPC with the Eigen function, you will end up with a functional model. So you can use that to individually model each batch of fermenter you have. So the FPCs are the individual characteristic of a batch and the Eigen functions are valid for all of them. In multiplying them, you get each individual curve. That's important to understand that those FPCs are the characteristic for the functional data you have. Then you have to design now. Now, looking into this. Basically you see that spline doesn't get the data very well. I know that up and downs here, those are important and that doesn't suit me very well. So I'm going to remove this fit and I will do... There are other models. A P-Spline is a penalized B-S pline model and that will do the thing here. Just want to point out there are other ways as well. The direct function of PCA does it without fitting a basis function. If you have a huge data set that works faster. For me, I already know from my data set that the P-Spline is that what I'm going to need. You can already see here this takes some time. So think about maybe some data reduction that you can get faster through that process if it's important. But be aware that if you reduce your data, you will be losing information about your process and maybe those will be the important ones. So maybe just take your time and just wait for some minutes, grab a coffee. So you can see here already that those lines do fit very good. So this is something I was concerned with to use. Of course, you can look at the diagnostic plots and see if everything suits you well. I will focus on that where I wanted to go originally. Keep in mind, where did you want to go? Wanted to add a pO2 curve here depending on the fermentation parameters. JMP is going to give the option to do a functional DOE analysis. That is exactly where we want to go. Basically in the background, it does a generalized progression with a two- degree factorial model. And the estimation method, it usually depends on the amount of parameters you have. It's either best subset or I think it's a forward selection. Let me make that smaller. Don't need this. But we do need this. That's where we wanted to go. So exactly what we need. We have the pO2 , the online response, dependency of our parameters. Just as a side point, if you don't see this modelization fitting, you have to do that maybe more by using the fit model platform. But if it's fine for you, you can go with that. Again, you have to save your prediction formula as you want to put that together. Be aware that I'm now in different data tables than I used before. So you can hit save prediction formula, click, and then you 're going to end up with your prediction formula here in that table. And my original model was in another data type. So you have to just combine them whatever you like, however you like. I just basically copied the formula and that was it. Then, let me show you if you want to put them together. I put them together in my original table. I have them both here now: the prediction formula of the harvest yield and the pO2 of the prediction formula. And how to put that together? Very easy. Use the profiler, put the function, and very important, tick the expand immediate formulas because you have that Eigen function there, so you need to expand the immediate formulas. Click okay. I, again, prepared a strip because the original evaluation, that's not so colorful. So I prefer coloring at some point. So I just put those two in here. So this is where we wanted to go. We have our harvest here and we have our pO2, and we have all our parameters in here for the dependencies. Now I promised you that we're going to see more. That is the seed inoculum in the component A we looked before, if you remember, and we saw that for low biomass we have a positive impact of component A in the main fermenter. And having more biomass, that behavior turns and now we are able to see what's happening. You see here that is the typical pO2 response to that set of parameters. And already going down here, you can see what happens. I'm going to go here a little bit more down. The pO2 goes down. So in the fermenter, we don't have a limit. Now of component A, we have a limit of oxygen. That is just the reason why that is not good for the process. Now with modeling your online data and your offline data, you can go one step deeper to understand what exactly causes all your behavior. That is a really great example how you can approach the understanding, especially the speed- up of your process development. Going back to my presentation. Basically, that is just what I showed you, just with more parameters. I added more online responses and also some DSP responses in here. You can put that at whatever extent you like. But be aware, here is no extrapolation control. There are different models behind each response. So JMP cannot put that together into extrapolation control. So here you might limit your factors over the factor settings of the profiler that you stay within in your design space. You can use JMP Standard to look at the profiles. You just have to do the analysis with JMP Pro, the FDE analysis. Okay, one more thing. Wouldn't it be great now to know at which time points an online sensor has a critical influence on your yield? So that you basically have the sensor as an input parameter and your yield on this side. So you can exactly say, "Okay, at this point, it's very critical to have my pO2 up or down to have a good harvest yield." Let's go. Next level. Now, I want to use the online data to model the harvest yield. I wanted to, before we start, put that graphically again and on this side, we have all the pO2. The graph I already showed you before. That's the pO2 of all the batches. And you can see that in the end here, we have this one's going up; the pO2 sensor here, the pO2 is very low; and here it's somewhere in the lower region of the sensor. And in each of these cases, the harvest yield has different behavior. So somehow the individual curves at this time point have influence on the harvest yield. That is exactly what I want to model so you can say which sensor profile really leads to a good yield. This is where I want to go. So I can have all the responses here and my harvest yield of course depicted and modelized over, this case, the FPCs. Remember that I told that the FPCs are the individual characteristic of the response. That is exactly what you do. So I'm going to show that to you in JMP as well. We can start working through that workflow and basically, it's the same but you just use the data in a different way. You use your online data, the summaries of the FPCs to model the generalized regression. Then you put everything again together, save the to prediction formulas in one data table, and there you go. Okay, where did we leave? So this is where we stopped the last time. We had our pO2 over the fermentation parameters. We already have the model of pO2 over the FPCs. Now what we need is the individual function summaries. So we need to modelize the FPCs we have for that response with the harvest yield because that is what I want to know. How do the F PCs contribute to the harvest yield? Now you can set here what you want to export. Basically, tick all of this. You need to save formulas because you want to have the formula of the pO 2 dependent of the FPC, and you want to have all the FPCs. Now you have to extract that data because you want to modelize that, and either save data here or if you have modelize more than one response, you can do that up here in save data or save the summaries. Then you get everything neatly arranged in one data table. Click that and then you will end up with a table like this. So you basically have all your FPCs in here and the formulas you will need here are there. Now we have to model the FPCs to get dependency to the harvest yield. You just go to the fit model platform, then you will choose what you want to model and that is your... It's supplemented. That is why I hinted that you really have to be aware of what you supplement in the very first step of the FTE. Now we're going to need the harvest yield here. It has to be supplemented. And then, depending on how you want to approach it, I normally take the FPCs. I use the response surface model. That might be sufficient for you. If you think you need another model, you're free to use whatever modelization you seem necessary to. In this case, I've put them individually. Sometimes, even maybe just modeling the mean would be enough. The mean here, maybe just modeling the mean is enough. I know that in my case it isn't. By the way, at this point, I would like to say thank you to Imanuel Julio, JMP engineer, who helped me out in going through that procedure as well. Thanks, Imanuel, for giving me a heads up for FTE. We need a generalized regression here. Hit run. And then, that dialog opens where you can choose different estimation methods. Basically, you could just try them. Then, the model is going to be done. The nice thing is that up here you're going to have a model comparison. So if you put in more than one, you will directly see the comparison of them. So how they behave and which has the best information criterion. Then, you have to choose whatever you want to go with. Be aware that something like best subset can take some computational time. So maybe go to the see how far that should go. You chose, then, your model. You do, then, save the prediction formula of that estimation method you chose. You hit that and then JMP will put that here. Then you have everything together that you need, this time in one place. You're going to take the profiler, put in your prediction formula for the harvest yield, and of course, of your online responses as well, add them here depending on how many you have modeled. Expand immediate formulas, as well, and click okay. I, again, scripted that, having it a little bit nicer. And then this comes up. You have your prediction formulas for pO2, your online responses and your harvest yield; dependency of the FPCs down here. I will just show you a short... If you turn on the FPC 1, you can see what happens. So if you have a lower harvest yield, then it shows you that here, you are basically on a rather low level. We already learned that low levels of oxygen are not so good for fermentation process. So that can be seen here as well. And if you have high yield you have no more harvest, more tighter, more product than your fermenter, you're going to see basically, this goes up. That's good. Same behavior here. And you can already see here that here goes up. So it's positive for your process that the pO2 goes up in the end. Now this is the point where, at least, I have to go to the subject matter expert and to give him that data because he is the one who understands that this level of analysis may be not good for a manager. But here you can really go into detail with all the subject matter expert on the, let's say, fermentation level. And having this option, this really speeds up your process, understanding very much. And you can see the impact with the plots JMP gives you very easily. And sharing that profiler with a process engineer gives you a real head start in what is important during the course of the process, and gives you really the opportunity to save a lot of time in finding that out. And if you do that by chance, find it out by chance, that' s going to take so much longer, so don't do that. I'm heading towards the end of my talk. And I want to say that the statistical analysis of fermentation data, JMP really gives you the power to explore and visualize those complex process very easy. You can deepen your process understanding and which process parameters are important, which interact and which do I have to look at at which time points of the process. And with the profilers and the different setups, JMP really gives you the possibility to speak that one mutual language to all levels from technician to manager so that really everybody can make more from the online and offline fermentation data and really speed up your biotech process development. Thanks for listening. I'm Benjamin Fürst from Clariant and feel free to comment and ask questions over the b eta channels.
After nearly two years, our experience during the COVID-19 pandemic has made us all experts in knowing the risk factors for developing a severe case. For example: male, advanced age, obese, hypertensive -- right? Well, it depends. When we analyzed data from the subgroup of the most endangered patients -- those who were already hospitalized and in critical condition -- we discovered some surprising differences with respect to the common risk factors. With a binary response (recovered/dead), fitting a logistic regression model seemed to be a reasonable approach. Due to the high dimensionality of the data, we used a penalized (Lasso) regression to select the most relevant risk factors. In our talk, we briefly introduce penalized regression techniques in JMP and present our results for critically ill COVID-19 patients.     Hello everyone. Thanks for tuning in to my talk. I'm David Meintrup, Professor at Ingolstadt University of Applied Sciences. And today, I will talk about A Lasso R egression to Detect R isk F actors for Fatal O utcomes in Critically Ill COVID-19 Patients. Over the last two years, something started to increasingly bother me, and it is not what you probably think now, the p andemic, or at least not only. The topic is connected to this. This is me giving a talk on deep learning and artificial intelligence at the Discovery Summit 2019 in the wonderful city of Copenhagen. And since then, it looks like AI has become the universal tool for everything, so let me give you an example. In May 2021, the General Director of the World Health Organization said the following: "One of the lessons of COVID-19 is that the world needs a significant leap forward in data analysis. This requires harnessing the potential of advanced technologies such as artificial intelligence." So to me, this feels like a bit, forget about the scientific method defining the goal, specifying the tools, stating the hypothesis, et cetera, Just drop the magic word artificial intelligence and you are on the good side. So therefore, I decided to give this talk in the form of a dialogue between an AI enthusiast on the left side and a statistician on the right side. And I'm going to talk a bit more specifically on statistical models, artificial intelligence, and penalized regression. And then in the second part of the talk, I'm actually going to present the case study about the critically ill COVID-19 patients. So let's get started. Here's the first question from our AI enthusiast. In the era of artificial intelligence and deep learning, who needs statistical regression models? Here's my short answer that I borrowed from Juan Lav ista, Vice President at Microsoft. "When we raise money, it's AI, when we hire, it's machine learning, and when we do the work, it's logistic regression." So I love this tweet because in my opinion at least, it condenses a lot of truth in a very short statement. AI has become a universal marketing tool. But for the real problems, we still use traditional advanced statistical methods. A little bit longer answer could be the following. If I look at the typical task of engineers and scientists, they include innovate, understand, improve, and predict. And deep learning and artificial intelligence is mainly a prediction tool. For everything else, we still need advanced statistical methods like traditional machine learning, statistical modeling, and design of experiments. Okay, but you have to admit that there are very successful applications of AI and deep learning. Well, there's absolutely no doubt about that. For example, predicting the next move in a game like chess and Go. The deep- learning algorithms do this way better than any human being. Or I would like to introduce my favorite artificial intelligence application, which is solving the protein folding problem. The protein folding problem has famously been introduced 1972 by the Nobel Prize winner, Christian Anfinsen, who said in his acceptance speech that a protein's amino acid sequence should fully determine its 3D structure. And over the last 50 years, this problem has basically been unsolved. And there was very little progress until DeepMind by Google developed an AI- based algorithm called AlphaF old that to a very large extent solved the protein folding problem. And you see two examples of this on the right side. This is very impressive and beautiful work. And I included it here because I wanted to clarify that this is the perfect deep- learning AI problem. We have a vast amount of data, we have a combinatorial explosion of options, and the result we are looking for is a prediction, the actual 3D structure of the protein. So for predictions, we should always use AI. Not at all. For example, in the dataset that I will present about the critically ill COVID-19 patients, we want the model to predict if the patient will survive or will die. But the pure prediction doesn't really help. If you know someone is going to die, what you want is you want to treat, you want to prevent, you want to know the risk factors, you want to be able to act and not just simply predicting death or survival. So what we need is a really interpretable model that will hint these things that we need like treatment, prevention, and risk factors. Another way of looking at it, let's have a look at typical data- driven modeling strategies. What is very typically done in deep- learning AI environment is that you take all available data, you throw it in the deep- learning AI algorithm. And it might predict very well the outcome that you're looking for, but you're getting a fully non-interpretable model. What's an alternative? An alternative is to already in the data collection process think carefully what data do you need with advanced statistical methods. Then apply a statistical model, and as a result, you get a fully interpretable model. Okay, says our AI enthusiast, But you are missing an important point here. Statistical models might be nice for small data sets, but for big data, they can't be used, right? Well, no. For large data sets, there are several intelligent ways to reduce the dimensionality before you start with the fitting process of the model. And I would like to introduce, at least shortly, three of these intelligent ways to reduce the model dimensionality. Number one, redundancy analysis, something you might have heard about. If you have a large data set, the price you pay is typically that the factors are highly correlated and you can measure the amount of correlation within the set of factors by the value that is called variance inflation factor. And then you can actually eliminate the factors with the highest variance inflation factors. Why? Because they don't add additional information to the set of factors that you are already looking at. So this is one classic way of reducing the dimensionality of your data. If you have categorical data, for example, you look at an X- ray of the lung and you can see different symptoms. Then you can have variables that describe, this symptom was there, this symptom was there, or another symptom was there, X 1, X 2, X 3. Maybe for your analysis, it's enough to distinguish a normal- looking lung and a lung that has some symptoms somewhere. In other words, you convert a row only with zeros to a zero. And if there's at least one one, you change it to a one, and you create a new variable catching this information. Or to give you an alternative, you could sum X 1, X 2, and X 3, and count the number of symptoms that you see on an X- ray. This procedure is called scoring and is a very efficient way of reducing dimensionality. Principal component analysis has exactly the same spirit. It recombines continuous variables. It takes a linear combination of continuous variables with the idea of catching the variation in one newly created variable. I call these dimensionality reduction methods intelligent, because when you apply them, you already learn something about your data. And that's the whole purpose of statistics, isn't it? Learning things from your data. So let me summarize the advantages of statistical models. First, they can be used for all kinds of tasks, not only for predictions. Second, the model itself is useful and fully interpretable. And third, you can start in a large dataset with intelligent dimensionality reduction before you actually fit the model. I'm still not convinced. Can you give me an example of a statistical model that you applied to a large dataset? Okay, so let's introduce logistic Lasso regression. Lasso is an abbreviation for least absolute shrinkage and selection operator, and why it is called like that, I will explain in a few moments. Let's introduce this Lasso regression in four steps. Step number one is to remind ourselves of the logistic regression model. In a logistic regression model, we have a categorical, in the easiest case, two- level factor... Sorry, a categorical response with two levels, and we have a goal to model the occurrence probability of the event. This is typically done with this S- shaped function that corresponds to the probability of the event actually occurring. The functional term is given here to the left, but the good news is that with an easy transformation that is called logit transformation, you can turn the original values into log it values, and then the result is a simple linear regression on the logit values. So the bottom line is logistic regression, is simply linear regression on logit values. Step number two. This is a classic situation of a two- factor linear regression model. And how do we fit this to a data cloud? Well, we do this with the help of a loss function. For example, we take the sum of squared errors, and then we look for the minimum of this function. This is the very famous and standard ordinary square estimator that is the result of minimizing specifically this loss function given by the sum of squared errors. Thirdly, something that is maybe less known, I would like to introduce concept f rom mathematics the norm of a vector that is actually just a representation of the notion of the distance of a length of a vector. Let's look at three examples. The first one that you see here in the middle is the classic distance that you all know and use. This is the Euclidean distance. It's calculated by taking the squares of the coordinates and then taking the square root. The unit circle as you know represents all points that have a distance one, from the center. This is the classic Euclidean norm. We can simplify this calculation by simply taking the sum of absolute values. So instead of taking a square and taking a square root, we simply sum the absolute values. This is called the L_1 norm. And what you see here, this diamond is the representation of the unit circle of this L_1 norm. In other words, all the points here on this diamond have distance one , if you measure distance with the L _1 norm. Finally, the so- called maximum norm where you continue to simplify. You just take the larger value of the two absolute values, x1 and x2 . If you think about what the unit circle is in this case, it will actually turn into a square. This square is the unit circle for the maximum value. So in summary, we can measure distance in different ways in mathematics, and what you see here, the diamond, the actual circle or square are unit circles. So points with distance one, just measured with three different norms, three distances, three different notions of what a length is. Finally, let's combine everything we've done so far. So we start with the logistic regression model. We add the loss function. And now, instead of taking the ordinary square loss function, we add an additional term. And this term consists basically of the L_1 norm of the parameter. You see that we add the absolute values, Beta_1 and Beta_2? So this is the L_1 norm of the parameters that we add to the loss function. Of course, this is just one choice. You could also square these. Then, you would get what is called a Ridge regression if you take the square of the parameters. This first one here, the top one that we are going to continue to use is called the Lasso or L_1 regression because this term here is simply the L_1 norm of the Beta, of the parameter vector. Now, overall, what this means is that you punish the loss function for choosing large Beta values. And this is why this penalty that you introduce leads to the term penalized loss functions. So if you have a punishment, a penalty for large Beta vectors, then instead of doing ordinary least squares, you do a penalized regression. Now, let's look a little bit closer to the effect that penalizing has. So this is once again the penalized loss function with this additional term here. Now, the graph that you see here is independent of the so- called tuning parameter, Lambda. The larger Lambda is, the more weight this term has, and the more it will force the Beta values to be small. This is why you see that the parameters shrink. And this is why this whole procedure is called absolute shrinkage. Secondly, in this graph, you can consider this area here, this diamond as the budget that you have for the sum of the absolute values of Beta. And on these ellipses, the residual sum of square is constant. So you're looking for the smallest residual sum of square within the budget. This in the case drawn here leads to this point here. And due to the shape of this diamond, these two will typically connect in a corner of the diamond. And what this means is that the corresponding parameter is set precisely to zero. And this is why this method is also good for selection because setting this parameter zero means nothing else than kicking it out of the model. So this is in summary why we call the L _1 regression Lasso. It has a shrinkage element and a selection element due to these two described features. One last practical aspect of Lasso regression is about the tuning parameter, this Lambda here. How do you choose it? Well, one very common approach is the following: you use a validation method like for example, Akaike information criterion, and you plot the dependency of the AIC of Lambda. And then you can pick a Lambda value that gives you a minimal AIC. On the left side, you see how the parameters shrink and you can see the blue lines that correspond to the parameters that are actually non- zero while the other ones are already forced to be zero. I'm still not convinced. Can you show me a concrete case study? Of course. So the data that I'm going to present here consists of 739 critically ill ICU patients with COVID-19 that was collected in the beginning of the pandemic between March and October 2020. We have one binary response that consists of the levels recovered and dead, and we have 43 factors: lab values, vitals, pre- existing conditions, et cetera. This is the data that we are going to analyze now. So here, you see the dataset. It has 44 columns, as you can see down here. And 739 patients. Now, let's start familiarizing a little bit here with the data. We have this last known status, recovered and dead, age, gender, and BMI. And then we have additional baseline values, comorbidities, vitals, l ab values, symptoms, and CT results. Let's look at some distributions. So this is the distribution of the last known status. So you can see that unfortunately 46 percent of these patients died. We see here the skewed age distribution with an emphasis here above 60. And you can see that roughly 70 percent of the patients are males. Have a look at the additional baseline values. You can see here the body mass index, and you will notice that we have a tendency of a high body mass index above 25. We have quite a lot of ACE and AT inhibitors, and also of statins, so treatments for cholesterol and for blood pressure and some immunosuppressive. Next, comorbidities. You see that almost two-thirds are hypertensive, and we have quite a significant amount of cardiovascular disease, of pulmonary disease, and about 30 percent of our patients have diabetes. Now, for the remaining four groups, vitals, l ab values, symptoms, and CT, I'm going to show you one representative from each so that you can get a feeling of how these values are distributed. This is the respiratory rate. These are the number of lymphocytes. So this is a vital parameter, a lab parameter. Here, you have a symptom, severe liver failure that can occur on the ICU. And this is a CT result, areas of consolidation that can be seen on the CT of the lung. Okay, so now we are ready and we are actually going to analyze the model. So I go to Analyze, F it M odel. I take the last known status. That's why I throw in everything else as factor and I go to General Regression which is going to perform the Lasso regression. You can see here that the Lasso estimation method is already preselected. So if I click on Go, the procedure is already finished. It's very fast, and this is the result. This is the screenshot that you saw already on the slide. Now, I'm not going to work with this model for the following reason. If I go here up to the Model C omparison section, I see that I have 30 parameters in this model. So this model is still very big. If I go back down here, I can see that my AIC doesn't change a lot if I put it further to the left. Instead of doing this manually, what I'm going to do is I'm going to change the settings of JMP so that it doesn't take the best- fit, the minimal AIC, but instead the smallest within the yellow zone. So this is something I can do here in the Model Launch. So I take the Advanced C ontrol and I change Best Fit to S mallest in Y ellow Z one. I click on Go. And now, actually I have a very nice model with 16 parameters. Now, this is the model I'm going to use. And to be able to show you what factors are in this model, I'm just going to select them, Select Nonzero Terms. Now, I have these 16 selected, and I can put them in a logistic regression and activate the odds ratios. So on the top, we see the 16 effects in our model. And below, we can see the odds ratios. So for example, the odds ratio of age is 1.07. If you take this to the power 10, it will give you roughly a value of two, which means that with 10 years more, roughly, approximately your odds ratio doubles. Your chance of dying is twice as high as before. And then below here, we have the odds ratios for the pedagogical variables. So for example, coronary cardiovascular disease has an odd ratio of 1.62. Now, I would like to point out some of these results. So let's first look at factors that are well- known from the general population. So here, you see the dependency of the last known status versus age. And you can see how this increasing age increase your chance of dying very significantly. As I said before, the odds ratio of 10 years is roughly double. On the left side, you see pulmonary disease and cardiovascular disease that both also have a significant effect on the risk of dying from COVID- 19. And these factors are also valid in the general population. Now, more Interestingly, we find these three factors not to be in our model. So gender, BMI, and hypertension are not part of our model. So how is that possible? Well, it's critical to remember that our population consists of ICU patients. They are already critically ill. So we have 72 percent of male patients, we have almost 80 percent that have a BMI over 26, and two-thirds are hypertensive. So these factors will actually highly increase your risk of a critical cause of the COVID- 19 disease. But once you are critically ill, at that point, they don't matter anymore. So that was a very important result for us. Which factors carry over from the general population and which factors disappear in their importance once you are already critically ill? Finally, I would like to point out one more aspect which is statins. Statins were entirely insignificant. You see the P value here of 25 percent once you looked at them univariately. But in our multi factorial model, they were highly significant. And as you can see here, the odds ratio is below one, reducing the risk of mortality. So this is a very important lesson. Sometimes, people choose risk factors first in large dataset by looking at them univariately. And if you did that, you would with guarantee have missed statins, because univariately, it's completely irrelevant. But in our multifactorial model, we could show that statins have a protective effect against dying from COVID-19 once you are critically ill. This finding has been later confirmed by others. And just as an example, I included a meta- analysis from September 21 that indeed confirms that statins reduce the mortality of patients in a very large meta- analysis with almost 150,000 patients. If you're interested in more details, we published our work in the Journal of Clinical Medicine. I would like to take the opportunity to thank my co-authors, in particular, Stefan Borgmann and Martina Nowak- Machen from the clinic in Ingolstadt. And I would also like to thank you very much for your attention, and I'm looking forward to your questions. Thank you very much.
An ultra high performance liquid chromatography (UHPLC) measurement system analysis (MSA) optimization and validation study in a quality lab is presented. Process settings of the analysis method are established in order to maximize measurement accuracy and resolution of two organic compounds. There are seven control factors and optimal DOE is used to specify how the experiments take into account specified model and experimental criteria. It demonstrates why OFAT is not appropriate and how to decide between a custom DOE and a DSD based on DOE diagnostics such as power, effect correlation and variance profiles. Using stepwise regression, good predictive models are obtained that are supported by validation experiments. The Profiler desirability function is used to determine the optimal and robust UHPLC settings for measuring both compounds. The particular importance of the sensitivity indicator for improving robustness is shown.     Hello everyone. This presentation will be about Measurement System Analysis, and optimization of a UHPLC M easurement System. Presenters are myself, Frank Deruyck from HoGent University College of Applied Sciences, and also Volker Kraft from JMP Academic Program will take care of the demos demonstrating the JMP tools for data analysis. Okay. The problem statement and description. Well, in a chemical company, just [inaudible] chemical company, the presentation was inspired by an internship of a student and a chemical company, and of course the material had to be kept confidential. So I just will talk about the chemical company, and also the figures are a little bit modified, but no problem. What is the statement? Well, SPC revealed significant batch to batch variation in raw material and has caused problems in product quality. So that, okay, there was an issue with the supplier that was necessary that all supplied batches, it was necessary to analyze them all. But of course one problem is that there was a too slow analysis procedure GC, and a fast UHPLC analytic method is under development, was in development, but not ready for validation because of too strong measurement variation. And the goal of this study is to specify robust and optimal settings of the UHPLC method so that validation of the new method will be possible. Thanks, Frank. Working with the JMP academic team for more than ten years now, we helped many university professors worldwide to get access to JMP licenses for teaching, but also to teaching resources like the case study library at the link, jmp.com/cases, professors get free access to more than 50 cases, each telling a story about a real world problem and a step by step solution, including the data sets and exercises. What we present today is available as a series of three case studies, focusing on statistical process control, measurement systems analysis, and design of experiments. While Frank will talk about the problem and the solution they developed for a Pharma company in Belgium, I will demo some of the analysis steps using JMP Pro. Let me say thank you to Frank for sharing these cases with the academic community, who really welcome such real-world examples coming from practitioners in the industry. I also want to thank Murali, from our academic team in India, who plays a key role in enhancing our case study library like the development of these cases together with Frank. Okay, here in this plot you see very clearly the illustration the problem. So what here is shown is a plot of the measurements of the new UHPLC method, but non- optimized and also as a function of the measurement of the GC standard method, which was very accurate and precise. And you can clearly see that there are some problems. So different operators made some measurements on different batches and you can see that the prediction intervals are quite large. You see a range sometimes over 100 milligrams per liter. And you can also see that sometimes it's not clear whether a measurement is within specification, like on the left graph you can see it and also on the right graph you can see there's also ratios with accuracy meaning that there's a serious problem, and first of all we will explore the variation root causes using measurement system analysis, the causes of variation, and also the DOE will use optimized according to the statistical thinking concept which will be illustrated in the next slides. You see the statistical problem solving process flow. So for the cost of the problem, the problem with our U HPLC measurement error, we will tackle this by measurement system analysis, and to export the variation root causes, I may use of course DOE, also to optimize the process settings of the U HPLC system. The method we will use for quantifying the variation sources is the measurement system analysis, and some theory I will show. So it's about the quantifying of the variance components of the total variance. Total variance means the variance by all measurement, by different operators, different products. So we have two different components, the product variation, Sigma square product, and the measurement variation. The measurement variation very important is also decomposed in two important components, the repeatability the variance due to lack of precision by repeated measurements, and also the values between operators. So the Sigma square large R which is the reproducibility. And very important criterion in order t hat, sorry, important criterion stating that the measurement system is only suitable for detecting variation in the process, process variation, is that the percent G auge R&R, which is the measurement error divided by the total error, should be less than 10 percent. So the fraction of measurement error was below 10 percent total of the total variation. Then we can use it for process follow up. If it's not the case, it is higher, then we run the risk that we will control our process on measurement variation, and of course not a very healthy issue. So it must be lower than 10 percent, that's the main criterion. Okay, let me go to next slide. And for this we will use a Gauge R&R study, a Gauge R&R study is mainly an experimental design, so that we will select three random operators, John, Laura, and Sarah, who will do repeated measurements two times on four different batches each. So we have done, we have been able to quantify the within operator variation repeatability, the reproducibility, the between operator variation, and also the product variation, the variation between batches. So, Volker, I'll leave the floor to you now. Okay. Thank you, Frank. And before I will come to MSA, I would like to briefly cover what's included in the first part of the series, namely control charts and process capability. So this is one of the data sets here measuring the two compounds, our continuous responses, compound one and compound two, using the good but slow GC method. So data has been collected over eight days, with two batches per day and for two different vendors, A and B. So when the team started, the first activity was to check and confirm normality of the data for this, they looked at normal quantile plots. They also fitted a normal distribution. Okay, sorry for this. So they looked at a normal quantile plot, and they also fitted a normal distribution followed by a goodness of fit test. And there was nothing critical from that analysis. Exploring distributions, the problem became clear. So looking at data from different days, we can see a huge batch by batch variation, even for batches coming from the same day. And this means that e-process monitoring is not possible because the variability of this GC method was high and the method was too slow. So it needed a lot of time to monitor the batches. And therefore one team activity was to work together with the vendors and others to reduce that variation. And another activity, which is described in the other parts of the study, investigated a faster method, as mentioned by Frank, using the new UHPLC measurement method. So just looking a bit more into the old method, the GC method. So the team looked at one dimensional control charts and also process capability for both compounds, of course, and they also looked at multivariate and model driven control charts. And here we see that there were some extreme points in a multi dimensional analysis, and they could also see the contributions of the different compounds or responses. This is the conclusion of the first investigation using the old method. So here the outcome was that both processes looking at the process performance here, that both processes, so compound A and compound B, were incapable but stable. For vendor A and for vendor B, we see that compound one was even unstable following these colors here. So all of this motivated the team to improve the measurement process. And this brings us to part two of this series and this is about analyzing the measurement system. So this data here were collected for all the combinations, repeated twice, between four batches and three operators, using the new UHPLC measurement method. So the goal here was to measure all batches of raw material using this faster method and maybe to allow some inline monitoring of the raw material in the future. So to get started with the new data, the team looked at a two way ANOVA, and this may fool you. This output. So on both compounds, compound one and two, the batch effect is highly significant. So that's good news. And also the operator effect and the interaction between batch and operator, these effects are non- significant, but the RMSE is quite high. So that means that we are maybe looking at data and the effects we are also interested in are just hidden by noise. So before you look at such an analysis, the first question should be where is the variation in our data coming from? And second, are we measuring the signal? Are we really looking at the signal or are we just measuring noise? And to get an idea about this, a perfect visualization of the patterns of variation is a variability chart. And for these two sources of variation, batch and operator, here we see all the data points for all operators and all batches. So we have two measurements per batch per operator. And for these two, we see the mean. We also see the group mean for one operator and we also see the overall mean. That's the dotted line here for all our measurements. And we can look at this for both compounds, of course. So here, for instance, for compound two, we see that Laura has quite high variation, at least compared to the other two operators. So that's a visual analysis. The analysis method and procedure to use to get a better insight into the measurement system's performance is an MSA or measurement systems analysis. And this was also done for the non- optimized U HPLC method. So here, for instance, for the first compound, we see this average chart and this chart shows the data together with the control limits or with the noise band. And what we see here is not good news at all, because our measurements, our data is within this noise band, so it will be really hard to detect any signal with that noise level. Another output is this parallelism plot. So here we can check interactions between batches and operators and this would indicate an interaction if some of these lines are not parallel. And this is the EMP method. So this stands for evaluating the measurement process and you probably know this as Gauge R&R output. And that's what also Frank mentioned. So here we see the signal, that's the product variation, but we also see the measurement variation split into repeatability and reproducibility. And here for the first compound, we see that we seem to have an issue with repeatability. So the same operator doing the same measurement again. And for the second component, we see that there's also a slight issue with reproducibility as well. So these are measurements between different operators. So the conclusion here is that the measurement system or the measurement process is unable to detect any quality shift caused by a significant systematic variation between our batches. And with that, I hand back to Frank. Okay, thank you, Volker. I think we can shift to next slides because that's what we discussed. Okay, it's what you showed, Volker, that's also in the slides. Okay, yes, here we can start. And we start with first of all, in the design of experiments, the optimization study, the process improvement study. And first o f all, in order to better specify our experimental goals, we should first of all go to the root cause analysis for the high measurement error. The lab team, after a brainstorm with the lab team, it came out that were two main root causes, one link to the equipment and one link to the method. And the one link to the equipment was the main source for poor repeatability because it was a very strong issue with unstable column temperature and eluent flowrate, the UHPLC uses column, it's a chromatographic technique, and also an eluent to make separation, to lead the compounds to the column and make separation possible. As a matter of fact it was drift between different experiments and even within one experiment. This results of course in a poor repeatability. And the first task, of course, of the lab team was to stabilize of course the column temperature and eluent. Because if you go to any experimental design, of course, we need to have fixed settings of the temperature and eluent flowrates, which are of course two important factors. For the second issue, the method because, besides the stability problem, which of course was fixed, there was also an issue with low resolution because of non- optimized analysis process settings which were quite low and also unstable, meaning that when we make small shifts to flowrate and temperature, there were sometimes huge shifts also in resolution, indicating that there was also not only optimal problem but also a problem with robustness. So the goal was specify not only optimal settings but also robust settings. So the variation of the resolution should be minimal as a function of some variation around the settings of the process, practice, of the analysis. Let's just now go to the DOE. And the goal, of course, of the DOE is that we specify the response variable is Y, the compound concentration in the standard samples, and we have to make models of this Y, the compound concentration is a function of the UHPLC control factors. And once we have the equations, we can then of course go to optimization, and what is the optimization criterion? We will use the quality P over T ratio criterion. That means that, that our, fraction of the error in the tolerance range should be lower than 10 percent. If you have the tolerance ranges of our compounds which are specified, compound one in the standard sample was a 300 milligrams per liter plus or minus 200 milligrams per liter, then we have to make sure that and also the, sorry, the standard sample, two, two standard samples, the target of compound two and the standard sample two is 450 plus or minus 150 ppm spec limits. And of course if you want to reach 10 percent of this specification ranges, that means that okay, we have to given our desirability function for optimization, is that okay? Why should match target compound concentration lower than 10 percent? Meaning that for standard sample one there should be 300 plus or minus 20 ppm, 20 milligrams per liter . And for standard sample two the compound concentration should be 450 plus or minus 15 milligrams per liter that's the criterion for optimization. So as for our model getting together with lab experts. The factors are the main effects, the main effects and all quadratic effects. And the main effects are the temperature of the column, with the range 25 to 35 degrees Celsius, eluent flow rate, five to 15 milligrams per milliliter, and also a gradient. What is a gradient? Well that means that there is an additive, Acetronitrile, in the eluent and this concentration of this Acetonitrile increases as a function of the volume, added as a function of flow, the volume, through the column. That means from volume zero and one milliliter it's a range between five and 20 percent. And once the volume is five to six milliliters, we have a range of 35 to 70 percent of Acetonitrile in the eluent. Also an important factor is the UV wavelength . The detector is by UV and it should be controlled between 192 and 270. And brainstorming with the lab experts, who did already quite some preliminary experiments and had some experience with the UHPLC, as a matter of fact, only two interaction effects were selected. That's the temperature and eluent flowrate and also the eluent flowrate interacting with all gradient factors specified above. So this means that the design chosen to meet the goals and to model, and to specify the model parameters. It was the custom design, and Volker will illustrate what design was about. Okay. Thank you Frank. So talking about the third part of this case study series which is about designing experiment and what we learned so far is that we have to reduce the measurement variation which is caused by this non-optimized UHPLC method. And this method can be described, as Frank pointed out, by these control settings or process settings, like temperature and so on. So we also want our responses remain within their limits. So this follows the 10 percent rule as given by Frank, and these limits are also added to our data. So to design such an experiment the team looked at a definitive screening design. So this one here, and also added custom design and both with 25 rounds. So comparing both designs, they use the compare design platform, and you can see several reasons which are in favor of the custom design. So here, for instance, for main effects we see slightly better power for the DSD. But for the other higher order effects we see really a high benefit for the custom design. Same here, looking at the fraction of the design space we see that the custom design is doing better. You could also look at the correlation maps, and finally, the efficiency is also in favor of the custom design. And for that reason the team used the custom design, a custom design for those studies. So here we have the completed data for the custom design, completed with both response measurements. And for this we also have the corresponding linear model. So here for the first response, compound one, and for compound two, both with their profilers of course. And here's a combined profiler with both responses at the initial, just the mid settings. And by maximizing the desirability, so I would get to the optimal settings and we see that we are matching both targets here, 300 and 450 respective, we are matching them perfectly. However, we also see quite large sensitivity indicators. These are the purple triangles, and they are telling us that at the optimal point, our response surface is quite steep in some dimensions and this reduces the robustness of our process in case of some random variation of our process settings. And this can be further analyzed by adding the simulator to this profiler and that is done here. So here the simulator defines the random variation which was defined by our process experts. And just keeping the mid settings, plus this random variation, simulating 10,000 of response values we see that all of our response values are out of spec. These are all defects. Of course, we are just at the mid settings , so nothing better to expect here. Switching to our optimal settings and simulating again. So now we see that the defect rate to be expected is above 12 percent. So from here the robustness can be further improved, either manually, using the profiler and the simulator and the sensitivity indicators, or automatically by running a simulation experiment which is also built into those profiles. So the team used a manual approach and these are the robust settings. They came up with these red settings here. And if I simulate again, §we see that the defect rate now drops below one percent. So this is a Monte Carlo simulation, it's all random so the defect rate changes slightly with each new simulation, and you can also see how the histograms of our simulated response data behave quite well. So they are within our range, within our limits which support this 10 percent rule. And, at these robust settings. So going back to the other profiler. So here I also have some contour plots using the contour profiler, and they can also be used to better understand the best regions for the processing and for configuring the process and these are typically the white regions which provide the in spec regions for a combination of two control factors. And I hope you like this journey, and with this I hand over back to Frank to discuss this outcome. Okay, thank you, Volker. So, now we can go to the validation experiment, of course, once we optimize the settings now we can make an experiment to check whether really the measurements of the new UHPLC system and the GC analysis are equivalent. So for this, again, we set up a Gauge R&R study, measuring also one which is quite similar to the one discussed before but also now we make measurements with the GC in order to compare the two measurements methods. You see here also included as one extra factor in the Gauge R&R is the instrument factor. Okay. So here are the results and you can see really that we make quite an improvement that you see in the Gauge R&R results. That the main variation now is product variation. You see that now the batch variation is no more obscured by noise measurement error. The gray area is quite narrow compared to the ones we had before, and that's a very good news. You see it's mainly product variation. So the Gauge R&R and ratio is quite, matches our target. Meaning that precision tolerance ratio here is about eight percent. And the precision/tolerance ratio is six times the Gauge R&R figure divided by the tolerance of the compound , 400, which is eight percent. So the precision is okay. So that this measurement system is suitable to be used in quality control for compound one. Nice. Compound one. Okay. You see on the parallel plot just a little crossing, you see that for Sarah, indicating a little maybe interaction between the batches and operators. Okay, for compound two, we have to see the same thing even better. So only 5 percent of Gauge R&R, the precision tolerance ratio, 5 percent, same thing, very narrow noise range and no major crossing of the lines. Quite parallel, indicating no operator bias, no interactions. Modelling the compound one analysis, we see that it's mainly influenced by batch, and also little batch to operator interaction effect. Compound one and for compound two. Okay, now we can see this very small interaction fact because we have reduced the measurement noise so much now, that very small facts, of course now become visible. In the first time we could not see it, we could not detect it. But that is because of a very poor experimental power. But now we increase the experimental power seriously by just reducing the experimental noise, of course, now this interaction in fact becomes visible. But the small one, you can see that also for the link to Sarah, the green line of Sarah, but here also was a little problem in the GC analysis as well, not only for the UHPLC, but also for the GC analysis. Okay, that's an issue to be tackled for wrong. But now it's here. Those two graphs illustrate fairly clearly that both measurement systems are nearly equivalent, the UHPLC results versus the GC results. We see that there's nearly a perfect, very good correlation with all points on the mid line. So the slope is nearly one. It's not significantly different from one. And also there is an intercept with the y axis, is not significantly different from zero. So the method was ready for validation, UHPLC is accurate and the non-significant difference, which you see, standard analysis , so very nice results. And we could say that the problem with tackling the problem with MSA and DOE was very powerful, leading to a good, very nice solution to our problem which could be implemented, of course, now in production. Thanks for your attention and if there are any questions, please let us know.
Abstract: Managed, non-persistent desktop and application virtualization services are gaining popularity in organizations that wish to give employees more flexible hardware choices (so-called "BYOD" policies), while at the same time exploiting the economies of scale for desktop software management, system upgrades, scale-up and scale-down, and adjacency to cloud data sources. In this presentation, we examine using JMP Pro in one such service: Amazon AppStream 2.0. We cover configuration, installation, user and session management, and benchmark performance of JMP using various available instances.   JMP analytic workflow steps demonstrated: Data Access (database, Query Builder, postgreSQL), Basic Data Analysis and Modeling, Sharing and Communicating Results, JMP Live.   Products Shown: JMP, JMP Live, Amazon AppStream 2.0.   Industries: General (can be applied to semiconductor, consumer packaged goods, chemical and pharma and biotech workflows).   Notable Timestamps: 0:18 – Background of case study for talk. Red Triangle Industries, a fictitious manufacturing organization has been growing their JMP use over the past 6 years and is faced with new challenges with growing data sizes, remote and flexible work environments and their “BYOD” (bring your own device) laptop policy. 2:19 – Results of JMP workflow assessment and interactive visualization using a custom map in JMP. Visualize the results of your own workflow assessment by downloading the custom map shapes and sample data here. 3:29 – Situation for Red Triangle and why they are consideration non-persistent application virtualization technologies. 4:29 – Pains for Red Triangle and problems they are trying to solve. 5:37 – Implications for Red Triangle adopting non-persistent application virtualization technology. 7:09 – Needs analysis and technology requirements. 8:00 – Introduction to demo. 8:13 – Create an image in Amazon AppStream 2.0 UI. 8:50 – Naming images, choosing instance type, sizing and picking IAM roles. 9:20 – Configuration of network access: VPC and Subnet as well as security group; enabling internet access. 9:52 – Launching image builder. 10:01 – Instantiating image. 10:38 – Configuration jmpStartAdmin.jsl for server settings for JMP Live instance and database access; IT configuring this so that users don’t have to. 11:36 AppStream 2.0 Image Assistant: Configuring JMP Pro to launch in AppStream for end users. 11:54 Configure template user within AppStream. 12:10 Testing JMP Pro configuration using template user. 12:30 Testing postgreSQL configuration and access from JMP Pro. 13:14 Testing final app withing the browser as an admin. 13:54 Image configuration and naming. 14:19 Creating final production image. 14:45 Creating fleet of compute resources for end users. 14:59 Always-on vs. on-demand vs. elastic fleet type options. 15:09 Configuration of on-demand fleet. 15:36 Fleet capacity. 16:03 Fleet VPC, subnet and security group settings. 16:58 Fleet creation. 17:20 Stack configuration 18:05 Creation of storage and home folders within a stack. 18:17 Configuration of user settings: Clipboard, file transfer etc. 18:44 Configuration of user pool. 19:01 Creating a new user. 19:14 End user experience: Launching JMP from Amazon AppStream 2.0. 19:32 Application Catalog 19:40 Instantiation of on-demand session. 20:05 Launching JMP and using within a browser. 20:23 Retrieving data from PostgreSQL database using the Query Builder in JMP. 21:06 Data analysis of resulting data set in JMP. 21:52 Publishing results to Red Triangle's JMP Live instance. 22:22 Verifying published results on Red Triangle’s JMP Live instance.         This is JMP in the Cloud configuring and running JMP for non- persistent application virtualization services. I'm Daniel Valente, the Manager of Product Management at JMP, and I'm joined by Dieter Pisot, our Cloud and Deployment E ngineer, also on the Product M anagement T eam. And today, we're going to talk about how an organization called Red Triangle Industries, who's been a JMP user for the last several years, and their JMP growth has been growing in their R & D departments, their quality departments, I T, how they're considering adopting to new, remote, and flexible work environments, and hopefully solving some problems with some new technology for virtualizing JMP in a non- persistent way. So we're going to play the role of two members of this organization, and we're going to go through the configuration, the deployment, and ultimately the use of JMP in this way. So this is how Red Triangle has been growing. We started in 2015 with a core set of users, and every year, we've been growing the JMP footprint at Red Triangle to solve problems, to visualize data, to communicate results up and down the organization. Most recently, we've added JMP Pro for our Data S cience T eam to look at some of our larger problems. We've got an IoT initiative, and we're doing some data mining, machine learning on those bigger data sets from our sensors and things that are going on in our manufacturing plant. And in the last year, we also added JMP Live to our product portfolio. You can see a screenshot of JMP Live in the back. And what we're trying to do is automate the presentation of our JMP discoveries and also our regular reporting in one central database so that everyone in the organization has access to those data to make decisions for things like manufacturing quality, for our revenue, and for other key business metrics that we're sharing all in one single place with JMP Live. So how is JMP being used at Red Triangle? Well, one thing that we did in the past year, our IT organization of which Dieter and I belong to, is we've surveyed all of our users, and we've put together an interactive visualization looking at which parts of JMP that they use by department. This is called the workflow assessment. It's something that we can send out, we get some information, and it gives us some opportunities to look f or growth opportunities, training opportunities. And also, this is how we found out that some of our users want to have JMP presented to them in different ways. So this is why we're considering the application virtualization. We've adopted a Bring Your Own Device policy, which lets our employees purchase their own laptop, and we want to be able to give them JMP in there. So this has put together a set of situations, some pains, some implications, and some needs that we're considering using JMP in a presentation virtualization, in an application virtualization standpoint. All right, so the situation for us after this workflow assessment. We're profitable. We're a growing business in the manufacturing space. We're adding more JMP users every year in different departments. As I mentioned, we have our core JMP use growing year on year. We've also added JMP Pro for our Data Science Team, and in the past year, JMP L ive for enterprise reporting and sharing of JMP discoveries. So I'm playing the role here of the CTO, Rhys Jordan, and I'm joined by Dieter, who's playing the role of Tomas Tanner, our Director of IT. And we've been charged with ways of getting JMP more efficiently to remote employees. We want to be able to analyze bigger problems and also want to support employees that want to take advantage of our BYOD policy or Bring Your Own Device policy in 2022 and beyond. So historically, our standard laptop deployments have used between eight and 16 gigs of RAM. And in some cases, especially with our larger manufacturing problems and sensors being put on many of our manufacturing equipment, we've got data sets that we want to analyze that are just bigger than that standard deployment can be. We also want to be able to support our employees and their flexible work environment, which means if they purchase their own personal laptop, we want to be able to get JMP to them and other software installed on these without physically being on site with the employees. We want to look into delivering that software in alternative means. Also, when new versions of JMP come out, and other desktop software, we want to be able to seamlessly apply those updates to our entire workforce and do that in a way that minimizes the latency between the release and when our employees actually get that update. And finally, when and if our employee leaves Red T riangle or moves to another organization or another part of the organization that doesn't require use of JMP or another software, we want to be able to retain those corporate assets with minimal operational burden. So the implications for this is we've been given a mandate, like many other organizations, to reduce our corporate technology spend, and we feel like the biggest potential for reducing that technology spend is through automation. So looking at these non- persistent application virtualization tools should speed up this entire workflow of getting software to our end users efficiently. We want to lower the total cost of resource and computer ownership. This is why we've adopted the BYOD policy. But we need to also right size the assets to the needs of the users, even these virtual assets. So our power users that are analyzing our biggest data sets will need more RAM, more speed available to them, and the casual users will be able to just right size that to their needs. With employees on three different time zones, doing something like just having a fleet of virtual machines for everybody at the same time doesn't make a whole lot of sense. Because we want to work the global clock, we can design a fleet of virtual assets that's going to look at just the total number of concurrent users that are accessing the asset at once. And that's what we'll get to in the demo here. And finally, that better roll out of software updates and the transparency of usage to our E xecutive T eam, who's using the software, how much are they using it, et cetera, are implications for us investigating this technology. And as far as needs, we want to go with a cloud provider. We're not going to build this tool in- house, so we want to use one of the cloud providers and the out- of- the- box capabilities that they have for application virtualization. Since we've moved a lot of our data sources to the cloud, to Amazon Web Services, for example, we'd like to be able to put our analysis or analytic tools close to those data sources to minimize the costs of moving data around. Our IT Department wants to centralize the management of JMP setup and license information, and also have that seamless version control. So as soon as a new version is released, we want to be able to push those updates as efficiently as possible and then look at usage tracking through things like cloud metrics and access controls. So with this, I'm going to hand it over to Dieter to give a demo of running JMP in an application virtualization, a non- persistent application virtualization tool like Amazon App Stream. Dieter. Thanks, Dan. So the first thing we have to do is we have to go to the image builder and launch that. So we have to pick a Windows operating system. So Windows Server 19 is what we want to pick here. So there are several available. We just pick a generic basic one like this one, move on, and give it a meaningful name and display name. Because we're Red Triangle, we use Red Triangle for this one. We have to pick a size for the image that we want to configure. So we pick a standard one, medium. I'm going to add an IAM role because I want to connect to S 3 where I keep all my installers. So just to make sure I can connect there, I add a role with access to S 3. Then I have to define in which private network I want to run my image builder here. Pick the public subnet so that I can actually connect to it from my local desktop. Security group, just to make sure only I can do that and not everybody else can connect. We're not worrying about the Active Directory set up, but we want to make sure we have Internet access so we can download things from the Internet, like, for example, a browser. We check all the details, they're fine, so we launch our image builder. This is going to take a while. AWS App Stream basically set up a virtual machine for us that we can connect to and set up our application. So after that has started, we connect to the machine as an administrator. And to save some time, I downloaded the JMP Pro installer already and did install JMP Pro just like you would on any other Windows desktop machine, and we have the application icon here. In addition to that, I have created a JSL script in the Program Data, SAS, JMP directory, jmpStartA dmin that has a few settings so that we make it easier for our users to do certain things. What they contain is a connection to a database and the JMP Live connection to our Red Triangle JMP Live site. So the users don't have to remember and type that in. So that's perfectly fine here. Then we have to go to the image assistant and configure our image. The first thing, we add an application to our image, that's going to be the JMP P ro that we just installed. So we are going to pick the executable. We use the JMP executable. We give it just some properties, give it a more meaningful display name. Save that. And what we can do now, here's our application that we want to make available to the user. The next thing we can do is test and set it up as the user would see it. So we are going to switch users here. We have the ability to switch to a template or test users. So the template user, that's defining how the user would actually run the application. So whatever we do here is going to be remembered, and the user will have the same experience as our template user. So we can do a few things in the set up. We can here also make sure that our database connection is working. We could do this as the test user as well, but I'll just do it here as our template user. So here we are, application is perfectly connected to our database. And with that, we're fine with our set up. And so we go back to the image assistant. Not going to use the test user, switch users again. I'm not going to show the test user. I'm going to go back to the administrator and continue with the set up of our image. So I switch here. Same, not going to the test user. Now, what we have to do is we have to optimize, we have to configure the application. So we 're going to launch JMP. Once it's running and we're happy with all of this, we continue the setup of the image by clicking the Continue button. And what AWS App Stream is doing now is optimizing the application for the user. So we just wait for that to finish, and then give our image a name and a display name as well. Again, we're using Red Triangle here. We also want to make sure we use the latest agent so that we always have an up- to- date image. Next, review, and we disconnect and create the image now. So with that, we are going to get disconnected from the image builder. Lost the connection, obviously. Our session expired. We return to the App Stream 2.0 console, and we see that our image has been created. It's pending right now. This also takes time to create it the way we want it to be. We have to wait for that to finish. We're done. It has finished. And the next step is to create the fleet the images are going to run on. So we create the fleet, we pick which type of fleet. We 're going to go in on- demand fleet because that's much cheaper for us. The images only run when the user actually requests the image versus the always- on would be running constantly. So here, we give it a name and a description. We then pick the type of image we want to give to our users. Bunch of other settings are available to us like timeout settings and capacity. For now, we're just going to go with the default. We can adjust to the needs of our users if necessary at any time. Click Next. We pick the image that we just created to run on our fleet. We define the subnet, the virtual network, and the subnet that our fleet should run in. Just pick the same we used before, and also a security group, of course, to make sure that only the users and hosts that we want can access our fleet. Again, we want to give the fleet Internet access, so we're going to check that to make sure users can publish to our JMP Live site. We could integrate active directories authentication here, but we don't want to do that. That would take us some time. So we're just going to go with some local users that I have already created. So we click Next. We are sending a review of what we did and it's all f ine. So we are creating the fleet. Some pricing information we have to acknowledge. And with that, the fleet is starting up. Once that has happened, we can move on and create the stack. The stack helps us run that fleet and helps us define persistent storage for our fleet, for example. So here, we create this stack. As well, give it a meaningful name. Since this is a Red Triangle sit e, we go with a very similar naming convention here. And we pick the fleet that we want to run in our stack. All looks good. We move on. Here, we define the storage. We're going to go with the default, which is an S 3 bucket that's available to each of the users. We could hook up others, but S 3 for us is fine at the moment. Just a couple of things on how we want to run our stack. All of them seem fine. We go with the defaults here. Quick review. Everything's fine, and we create our stack. That's it. The stack is there, stack has been created. What we now need to do is go to that user pool that I mentioned earlier, since we're not using a fixed directory. In here, I have defined three users that can access our stacks. But what we need to do is we need to assign that stack to each of the users. So in my case, I'm going to pick me and assign that stack that we just created. And we could send an email to the user to make sure they are aware of what just happened, and the stack has been assigned to them. That's all we have to do to set this up. So if I now go to the link that was emailed to me, I can log into that A ppS tream session. I use my credentials that my admin has defined for me. Here are my stacks. I use the Red T riangle, and here's the application that that stack provides for me. So this is going to take a while. As I said, it's on- demand, so it's like pooling a PC and running JMP on that machine. So it's going to take a few minutes. The always- running would be much faster. But again, they would cost money because they're running constantly, versus the on- demand runs only on demand. And here, my browser, JMP is started and JMP is running just perfectly fine in my browser. So let's do some work. I'm going to connect to a database. And just because my administrator has set this up already for me, there's nothing much for me to do. The connection to my database is already there. There are my tables available to me. So I'm going to pick one of the tables in my database, and this is a Postgres database. So I'm going to import it right away, and here's my table. I've written a nice script to build a wonderful report. I'm going to just quickly create a new script. I'm going to cut and paste that from my local machine to my AppS tream image by using the menu that's available to me. And now, I can cut and paste it into my scripting editor. I run that, and here's my report. That report, I'm going to publish to our Red Triangle JMP Live site now. So I'm going to File, Publish. And because, again, my miss rate has set this up for me, the information about my Red Triangle site is available to me, so I'm just going to get prompted to sign in to make sure it's really me. In this case, I'm going to use our imaginary identity, use the username and password, sign in, go through the published dialog, and don't change anything, just hit Publish. Report has been published. And now, what I can do, so I can go to another tab on my browser and verify that the report is actually published to our Red Triangle JMP Live site. So I switch over, go to All Posts, And here's the report that Tomas just posted a minute ago. And it looks exactly as it did in my virtual machine. Thank you very much.
JMP is an all-in-one, self-service platform that offers a vast menu of easy-to-use analytical capabilities that can be assembled together to create an end-to-end analytic workflow. This presentation walks attendees through a case study where a workflow is built to solve a manufacturing challenge and to increase the value/understanding of analytics in an organization. The proposed workflow begins with data access components, then includes a combination of analytical methods to uncover discoveries in the data. It concludes by sharing these insights using the JMP Live platform.     Today we're going to cover the JMP Analytic Workflow, so that you can see firsthand how all these capabilities come together. Anybody who's doing data analysis has a shared objective. They're trying to take raw information, data, and we're trying to turn that into a shareable insight or an actionable insight. The only thing that's different is the steps that we'll take as we move from one end of this process to the other. Whether you're new to the field of analytics and statistics and have more simpler needs, or whether you're a more advanced practitioner with more sophisticated needs, JMP Software offers the flexibility to meet those needs wherever you are in your analytics journey. The JMP Analytic Workflow is a quick and easy set of analytical capabilities to bring you from data to insights. What we're going to cover is a few workflows so that you can see how this can be implemented in practice. I want everybody to just picture right now that we're responsible for this machine, and this machine produces product for a business. Recently, the performance of our product has been outside of our expectations, and we believe the answer for what is going on can be determined by analyzing some of the data that's available on this machine. So what we're going to do is we're going to build an analytic workflow to see if we can figure out what's happening with this issue. Building the workflow involves these three steps. The first step is having an understanding of the data. The data that we'll be using in this case are machine logs that are saved on the machine as Excel files. The second step relates to the analytical capability. In order for us to know which analytical capability we need to use in the workflow, we have to understand the question that we're trying to answer, and the question that we're trying to answer here is a simple one. What is happening with this machine? The third element is a shareable insight. So after we get an answer to this question, we are going to have to share that insight with others. In this case, we're going to be sharing our insights with management, and we're going to be sharing those findings as a Word document. So more specifically, when we take a look at the workflow, we're going to be working with Excel files. We're going to leverage JMP's data access platform to bring the data into JMP. Once the data is in JMP, we're going to perform some data exploration and visualization, and then lastly, we'll share and communicate those results as a business document, which in this case will be a Word document. So to begin the process, we open our Excel file. When you open Excel files in JMP, it opens a special tool called the Excel Import Wizard. And the Excel Import Wizard allows us to do many things. We can access different worksheets in the Excel file, and we can perform some very simple data cleaning steps before we import the data. We can also preview the data. So as I look at the preview, I can see that I have information from June of last year, and I can see that I'm correctly capturing the measurements from our machine. I can import this data now into JMP, where we have a JMP data table. Now that we have our data, we can perform a visual exploration. I will use the Graph Builder tool, which is available under the Graph menu. And I'll plot our measurements over time for our piece of equipment. And as I plot our measurements over time, I can begin to see something that's quite surprising. The performance of our machine was meeting expectations initially, but over time you can see that the performance has slowly drifted, and now we're into a region where we're performing and producing bad material. This is the first time that we've now used data and analytics to understand what's happening in our process. And what we're seeing here is that the machine has actually been performing for quite some time without a calibration. So if we can calibrate the machine, we can get it back to the original performance for what we need in order to produce stable material. So this is quite a significant finding. And now we want to share this finding with our management so that we can take that additional action, which is to perform the calibration. When it comes time to share this insight, we can just simply export this, and in this case, we're going to export it as a Word document. So we're going to capture that Word document, and we're going to share that now with our management. So here we have the Word document. We've captured that visual, where they can see exactly what we saw in JMP, and they can see that the performance of the machine has been drifting over time, and that a calibration needs to be performed. This also represents the first time that management is starting to use analytics, and they're now starting to see the value of data in their organization and how that can help them improve their business decision making. And they have a new ask for us. They want to know what else can be done with their data, and what else can be done in terms of analytics to improve their manufacturing processes. So now the analytics journey has evolved, and JMP is very much a part of that journey. It's not just a tool that offers you the ability to access individual analytical capabilities, but it's also very much a part of the process so that you know how and when to implement certain strategies. So you spend some time reviewing a variety of JMP resources, and you review white papers to learn about best practices. You also read through customer success stories to see how others in your industry are leveraging analytics and how that's improving their business. You also participate in a complimentary statistics course where you learn about many things. You learn about predictive modeling, and how that can help you root cause production issues. You learn about reliability analysis and how that can help you understand how your product is going to perform in the field over time. But one of the most informative things that you learn about is the field of quality analytics. And by taking that learning now, you apply that to exactly what you're responsible for in your process. So with this new learning now you have a more advanced analytic workflow. So here in our more advanced workflow, we have some new iterations. Now, data is no longer being stored and accessed as Excel files. All the data is centralized into a database so that the integrity of the data is never affected, but also so that everybody can access the data without having to work with these individual files. In terms of analytical capabilities, you now have a better understanding of what statistics can do and what analytics can do. So your questions are more refined and specific. The question that you want to answer now is, is the machine experiencing special cause variation? Because you've learned about the difference between common cause variation and special cause variation, and you know that it's a special cause variation that ends up being problematic to your processes. And the last thing is the shareable insights. Before, when you were sharing your reports as Word documents, what it was doing was it was creating a lot of additional work for yourself. As people were consuming these reports, you're now getting inundated with requests to make modifications to graphs. You're also getting inundated with requests to the location of the most recent outputs. And then so what you want now is a better tool, one that allows you to centrally store all those reports in one location, but offers the people who are consuming those reports additional capabilities, so that they can perform their own exploration without having to come back to you for additional requests. So the analytic workflow that we're preparing now involves these steps. Our data now begins by being accessed from a database. We leverage JMP's database utility to get access to the data imported into JMP. We continue our data exploration and visualization, but we also incorporate some quality and process engineering elements that we've recently learned about from JMP resources. And when we share analyses, we want to both manage the content but also share these analyses with the wider audience in a way that's going to offer them greater capabilities, which they weren't able to do with their Word documents. And then so we'll be using the JMP Live platform to do this. So we begin the process by accessing our data. We use JMP's built- in query builder tool to access our data connection. Once we're connected to the database, we can access any of the data tables. Here, we've selected the data table that contains the data that we're interested in. And now, unlike before, we're able to pull data from everywhere in the factory, not just on an individual piece of equipment. We import the data into JMP, and we can now perform our new analysis, which leverages some new tools that we've learned about under the Quality and Process module in the JMP. In order to get an answer to the question that we have, it will require us building a control chart. The control chart allows us to set up a visual that looks very similar to what we created before, but the control chart allows us to access some additional capabilities that we weren't able to do in just a normal graphical visualization. Built into the control chart are some rules that we can leverage to determine if we're experiencing some special cause variation. And that's the question that we're trying to get answered. So we will enable some warnings, which are some special customized tests, to signal to us if we are experiencing special cause variation. Now that we've turned on that test, we can see that there are many batches where we're facing special cause variation. And had we been monitoring our equipment using this tool, very early on in the process we could have detected that there was an issue and we could have taken the appropriate action. So this is quite a significant finding, and this is something that we want to share with a wider audience. So now when we share this report, we share it with the JMP Live tool. So we're going to publish this to JMP Live. So I'm going to connect with my account to JMP Live, and I'm going to create a new post. And I'm going to share this post with everybody on my equipment team who 's interested in these results and needs to know these new insights that we've just discovered. I'm going to publish this report to our JMP Live. And now we can take a look at that report in JMP Live. So JMP Live is a web- based tool that allows anybody to access the report from their browser. So here now we can see that report in JMP Live. JMP Live also allows for all the reports to be centralized. So we no longer have to pass and share around static documents and Word docs , where sometimes people can be consuming old results and not be up to date with the latest findings. Now because everything is centralized in JMP Live, there's one version of the truth, and you'll always have access to the most recent files. You can also do things tha t you would not be able to do in static versions of the analyses. So JMP Live is still very interactive, and anybody consuming the report can perform their own exploration and get answers to their own questions without having to come back to you, the analysis, the person who prepared the report, for additional modifications. And as management is consuming this result and they're getting additional value, a very often thing is occurring, and their needs are now changing. And instead of seeing this information once a week in a weekly report, they want to see this information more rapidly. They want to see this information daily or even hourly, and they don't want a chart for just one piece of equipment. They want a chart like this for every piece of equipment in the factory, because they recognize how powerful this analysis is, and they ask us, "Is there a way that we can do this?" And JMP tool offers that flexibility to do this, because a critical part of the JMP Analytic Workflow is the ability to automate. As we were building these analyses, in the background, JMP was actually capturing the JMP scripting language to automate all of these steps. So by simply saving the script, we can stitch together all of these actions. We can stitch together the action to connect to the database and import the data. We can stitch together the action to generate the chart, and we can stitch together the analysis to upload the analysis to JMP Live. And in the click of a button, we can have those analyses automatically created by JMP. And in our case, we want these analyses produced every hour so we can leverage the Windows Task Scheduler to automatically run the script on our behalf, so that we don't even have to do it manually. So very quickly, you've seen a variety of different examples of how the JMP Analytic Workflow can be leveraged to solve a variety of different problems, depending on where you are in your analytic journey. We can put together the workflow to save both time and effort. We can easily access data from a variety of different sources and share your discoveries with other team members. We can get more from your investment. We can increase your efficiency without increasing the head count, and also eliminate the need for multiple tools. We can remove barriers in complexity. We can tackle problems of any size, like we've seen today, by using JMP's extensive suite of analytical platforms And we can accelerate process improvement by leveraging the automation to reduce time spent on repetitive tasks and get to those actionable insights faster. As we've seen firsthand today, your analytical needs might start off as being very simple, but when you're ready to grow, we'll be ready for you.
This presentation showcases designing a special music hearing test to test a musician’s ability to hear melodies. The Definitive Screen Design (DSD) platform in JMP was utilized to consider six music script input variables (step, speed, notes changed, note level, repeat, difficulty) and then added two more center points for evaluating the Gage R&R performance. Each DSD run is a multiple-choice test allowing respondents to pick their response from four available choices.   JMP Hierarchical Clustering platform was used to group similar music scripts from the 20 scripts provided by DSD runs and assign the similar scripts for the other three non-correct choices. The correct choices were then added to make each hearing question more challenging. Next, a stratified cluster hybrid sampling method was adopted to select 30 candidates to participate in the survey. Once the scripts were determined, a commercial music synthetic software program was used to create this DSD melody hearing test. After collecting the survey results, the Fit Definitive Screening platform in JMP was used to analyze the DSD survey results. The goal was to determine the best rater (higher propensity for accurate rating of musical melodies) to serve as the judge for next project phase.       All right. Well, thanks, everyone, for joining us. The title of our project is Design a Digital Music Melody Hearing Test. I'm Patrick Giuliano, and my co- presenters are Charles Chen and Mason Chen who couldn't be here today. So I'm going to be presenting on their behalf. And this is a project, a high- school STEM project inspired by ESTEEM's methodology, which is basically STEM but with AI, math, and statistics well- integrated. Okay, so just to introduce this project in the project management flavor with the project charter. The purpose of the project, in effect, is to design a test to test the hearing capability of a musician. The experimental design, philosophy or methodology we use is JMP's powerful, definitive screening design capability. And we designed the test based on six music melody variables in order to test hearing capability, where each question starts with a short melody followed by four choices, and where only one is repeated and the other three melodies are similar but not identical. From this test, each listener has to pick their best choice among the options available. Once we designed this test, we analyze the test survey results. We build a sensitivity model in consideration of six music hearing variables, and then screen the listeners to determine which ones performed the best in the music hearing test. And in doing so, in the screening process, we analyze the strengths and weaknesses of their hearing capability in the service of ultimately creating an orchestra with a grading of listeners who are highly capable to evaluate them. Okay, so in the service of science, we have an introduction to the mechanism of hearing where the ear is just basically a frequency- receiving apparatus that collects sound and vibration of the ossicles in the ear and cause the mechanical vibration to be converted into an electrical stimulus, which is interpreted by the brain by the auditory nerve and ultimately by the brain. All right, so before we get into the experiment and the variables that we analyzed, let's talk a little bit about the frequency range of hearing among individuals depending on their age. So people of all ages without hearing impairment should be able to hear at a frequency of approximately 8,000 Hertz, and gradual loss of sensitivity to higher frequencies with age is a normal occurrence. And so what the science tells us is that the auditory structures of younger people are typically more capable of absorbing or interpreting hearing higher frequency sounds, which is, of course, relevant in terms of which instruments people are playing, where the violin has a higher pitch than the cello, so perhaps a younger person might be more suited for playing the violin than an older person. And so this just gives you an idea that basically people that are in their fifties maybe may only be able to hear at 12,000 kilohertz... or 12 kilohertz rather, 12,000 Hertz, whereas people in their 20s can hear up to perhaps 18 kilohertz. And just to give some context, the average frequency range for what we listen for the sounds that we hear most often every day is between 250 Hertz and 6,000 Hertz. Okay, so what are some challenges associated with hearing in the context of sounds of different frequency? So people typically miss high frequency sounds more often than low frequency ones. And people with high frequency hearing loss, they have trouble hearing higher- pitched sounds, of course, right? And so higher pitch sounds can usually come from women or children and are in the upper two to eight kilohertz range. And what's also typical with high frequency hearing loss in many people is the presence of a phantom sound, which is the condition called tinnitus, and that competing sensation of sound can also inhibit a person's ability to distinguish other high frequency sounds. So clearly, age is an important factor in terms of designing an effective hearing test and developing an effective panel of listeners who are attuned to music. Although we didn't explicitly consider age in our experiment, as you'll see in the subsequent slides, it definitely could be a factor that we could explore further in our sampling strategy in terms of the survey respondents that we choose. Okay, so the basic measure of hearing performance is called an audiogram. And what you see in the graph on the right is just a plot of hearing threshold level in decibels on the vertical axis versus frequency on the horizontal. And you can clearly see that as hearing loss progresses, the threshold level of sound and decibels starts to increase and the degradation and the performance is shown as the plot splitting the performance by year moving down into the right. That's the trajectory of the line that's connecting the points moving down into the right. Okay, so just a little bit more background before we launch into the design of the survey and the analysis. The intent here is just to emphasize that frequency interference can be a problem in producing a melodious harmony in an orchestra in particular or in any sort of musical composition. And what we're basically showing here is the difference between what's called fundamental frequencies and harmonics in the context of a piano, at least at the note scale indicated at the bottom. Okay, so what do we know about the music note frequency spectrum? Well, each note has, not surprisingly, based on the introduction so far, each note has a particular frequency. As an example, middle C is at around 262 Hertz, and higher notes, of course, are going to have higher frequency and lower notes have lower frequency. And this slide just gives you a context for what frequency the notes correspond to. So note A is around much higher, 440, then note C at 261 in the second set on the right, in the lower portion of the slide. Okay, so there's some relationship between frequency and the number of notes. Frequency needs to double every 12 notes, and we have 12 notes in each octave, seven white and five black. And so you can see that this relationship, that frequency follows as a function of these notes, and n is a power- law type relationship. All right, so taking us back now to the project and the implementation and the analysis. So the project plan has three phases. The first phase is what I'm going to cover, it's the analysis that I'm going to discuss today. The first phase is effectively the process of identifying which people are best hearing performers from a collection of survey results that we send out based on the survey that we designed. The second phase is identifying the best hearing performers from the survey results in order to serve as judges. In this phase, basically, we try to work on forming the orchestra prior to phase three where we're actually doing the forming. But in this instance, we're thinking about things like which instruments have any potential limitation. And we may give the same melody to different test instruments, and not every instrument can play every melody, obviously. And so the idea is, how do we know that the individuals that are playing are playing these instruments accurately? Well, we need judges who have good listening capability. So the judges that we curate from phase one will provide that excellent evaluation in phase two. So once we have that in place, in phase three, we can actually really form the digital orchestra. And we'll think about things like how many players should be involved, who should play where, obviously. We'll have a good understanding of how the melodies could be difficult for certain instruments. And this is why we need phase two in the middle. Okay, so here's our design or survey question design. So we've identified six variables for this hearing test related to music, the parameters in music: step, speed, notes changed, notes level, a repeat variable, and a difficulty variable, a categorical variable, easy or difficult. The experiment, as I mentioned before, is we're using JMP's DSD, and in addition to the default, we're generating a default DSD, and then in effect, we're augmenting the design by adding two more center points. So we're doing an 18- run DSD, which includes one center point, which is row number three in this table, to have indicated with a zero and an arrow highlighting row three. And then we're adding two more center points at row 10 and row 20 respectively. And the idea here in terms of, we're replacing these center points, is we want to get an idea of how consistent the results are throughout the experiment. So we try to put a center point roughly at the beginning, in the middle, and the end of the experiment. And this is analogous to understanding whether a measurement process is stable, if you're in a manufacturing environment, getting a sense for that. And then the other important thing about our design here is that we're randomizing the test sequence, and that's something that we can do in JMP through the generation of the design. And I'll show a little bit about that briefly when we come to the next few slides. And that randomization is really important because it helps eliminate any bias due to factors that aren't in the experiment when we run the test. And that bias is referred to sometimes as lurking variation or variation due to lurking variables. Okay, so there's another consideration that I touched on. It's in the context of randomization, but it's slightly different context, which is a little bit more unique to this particular application and experiment. And so basically, what we did is generated an initial random variable and assigned a random sequence, one, two, three, four, and randomized. But we did a recoding on that. So we labeled one A, two B, three C, and four D, and that' s what we see in terms of identifying the correct answer. So in the two columns at the right in this table, in this 20- row table, we're identifying what the correct answer should be in terms of the letter, which is associated with a random variable of one, two, three, four, w here one corresponds to A, two to B, three to C, and four to D. And we're doing this to ensure a uniform distribution of the location variable. And basically what that means in practical terms is that A, B, C, and D all have equal percentage of being selected at random. And this is to avoid the biasing situation where a student may pick the same answer over and over again in order to possibly increase his or her chances of performing well, or perhaps because the survey respondent isn't paying attention or isn't engaged in the survey. All right, so here is where we come to an evaluation of the performance of the design, of the DSD design. The way we approach this is through the evaluation of the statistical power of the experiment which is shown on the left, on the panel on the left through an evaluation of the confounding pattern or the extent to which factors are correlated in the experimental design, and that's shown in the panel in the middle, and the uniformity, what we call the uniformity of the design, which is simply, what does the structure of the design look like in a multivariate space? Have we covered all of the design points in an approximately uniform way so that we're able to predict across the entire range of the experiment with the same degree of precision? And so what these three indicate, and going back over to the left, is that the overall power for each of the factors in the experiment is greater than 90 percent, which is good. And it shows us that we have good sensitivity to detect effects, if they're actually there in the population. The panel in the middle shows that the risk of what we call multicolinearity or excessive correlation among the experimental factors is low because all of the pairwise correlations in this correlation matrix, most of them are blue, where a more bluish correlation corresponds to a lower correlation, where solid blue indicates zero correlation. And the squares that are closer to a red shading indicate a higher extent of correlation among factors or terms in the experiment. And so overall, what we see is that we look for correlations that don't exceed 0.3, and that's typically all the squares in this plot with the exception of those slightly reddish squares where the correlation is a little bit higher. And that's because we have categorical factors, right? We have at least one categorical factor in this experiment. And if we didn't have the presence of a categorical factor, this plot would look even bluer. So we say that in DSD, we don't recommend adding too many categorical variables into the experiment, because if we do, then we increase this correlation problem, which affects our ability to produce estimates in our model that are precise, leads to inflation of variance in our estimates. And the final plot on the right, on the far right, which is an indication of the uniformity of this design, is a scatter plot matrix in JMP, and it shows each variable versus every other variable. And what we're looking for is for white space to be minimal in this plot. What I've drawn is a little circle here which your eye can easily pick out. There's a little bit of extra white space there at the intersection of Repeat and Step. And that, again, is because we have a categorical variable in our experiment. And so truthfully, there's no perfect zero in the main effects, no true center point in the main effects due to the presence of that difficulty variable, the categorical variable. And that's reflected in the non- symmetric pattern of the scatter plot matrix on the right, slightly non-symmetric, where that asymmetry is indicated in that white space and with the circle that I've drawn. Okay, so before I discuss this slide, I just want to quickly show you how I got to these design diagnostics. So what you're seeing here is the table that I just showed you. And I've generated this design using the DSD platform under the DOE menu, under Definitive Screening and Definitive Screening Design. And after I did that, JMP already generate, after I complete the design table generation process and fill in the results, JMP generates a DOE dialogue script, saves it to the data table, and I can actually relaunch the DOE dialog, and I can also evaluate the design. So I'm going to go ahead and quickly click on Design Evaluation. And this is just an overview of the design. And right here under Design Evaluation is where I get the diagnostics related to p ower, which I showed you on the left panel on that slide, the diagnostics that indicate to the extent to which factors or terms in the experiment are correlated. And that's shown here in the color map on the correlation. And to generate the plot, looking at the uniformity among the factors, I actually have to go in and do that in s catter plot matrix under the Graph menu, S catterplot Matrix. So that's just some context for you. And now, I'm just going to quickly bring up the next slide and then come back to JMP here just to dynamically show you what we're doing. So here's probably the most interesting part of this experiment. How do we increase the survey test difficulty and do it in a smart way? Well, we can use hierarchical clustering analysis to do that. Now, we already know the correct answer. It's indicated here in the corresponding column. The Choice column, the columns of the four variables on the right which indicate the choices corresponding to the 20 melody choices are indicated there. So we know, for example, in the first row, the correct answer is C corresponds to melody one where the C ID number is one. So we already know the correct answer where we've assigned it in terms of row order based on a random number, but how do we pick the other three answers? Well, based on hierarchical clustering, we can get a sense of how close each of the other three answers are to the correct answer. And in this way, we can make the test a little more difficult. So all the answer choices are from the 20 melodies. How do we pick the closer formalities for each question, or the closest formalities, if you will, or even maybe melodies that are relatively close together based on the clustering criterion, but not honoring that criterion strictly, right? So this might seem a little bit nebulous, but in effect, all we're really doing is telling JMP to assign a clustering scheme by row and based on some clustering criterion that we specify. And by default, that criterion is Ward. So I'm just going to show that dynamically here. So I have the table open. All I did here is run Hierarchical C luster under the C lustering menu. And once I ran this, I went ahead and invoke Cluster S ummaries, which I turned on here. And then watch what happens here when I click on each of these clusters. So you can see that when I click on each of these clusters. These are the clusters. So seven and 18 are associated with each othe r, 14 and 17, eight and nine, row two and 13, and so on. So this is the idea. We're using the power of JMP to identify rows that are associated with each other. And in this way, by arranging the answer choices close to each other, we make it relatively close to each other by following some schema like this, we make the test more difficult. All right, so just launch back into slides here again. Okay. All right, so basically the last step here in terms of completing this experiment is in addition to using a passive criteria for increasing the difficulty of the test, we want an active criteria. So we want to be able to separate, in effect, the beginner level from the advanced level. So think of it like this. If every question was super difficult or if all the choices were very hard to discriminate from, then you wouldn't be able to distinguish between an advanced- level respondent and a beginner- level respondent because everybody would miss all of the questions. Similarly, if you made all the questions too easy, then you'd have all experts and no beginners, and so you have no differentiation. So based on the science, we have a hypothesis that step and speed are the most important factors for performance, for hearing performance, for discriminating between a good melody, a good composition, and a bad musical composition and a bad one. So are we sure about that? Well, one thing we can do is we can re code the step and speed by a 50 percent reduction if it's at difficulty level equal to difficult. And by doing that, in effect, we still have five variables and those are indicated in the shaded, right? So the recoded step, recoded speed are the two columns that are shaded. And then we have the notes changed, the notes level, and the repeat. So the DSD is still orthogonal. We still have three levels. We have five variables, but actually we could incorporate up to six in the DSD design. So how do we increase our value in effect by increasing that variable number to six? Well, we can add the difficulty variable or the categorical variable which indicates either easy or difficult. So we decided to use step and speed, combined with these other three variables, and the total sample size is still 18 plus 2 or 20 with the two center points and the one center point by default. But now, we get five levels for speed and step, not three. So by doing this little transformation, we smartly create five levels on two variables instead of just having three levels, which is typically what we would have in a DSD. So I think this is a unique approach that's also quite specific to this problem context and gives us more levels in our design. Okay, so this is our design. How are we going to create the... What software are we going to use to basically generate the hearing tests? Okay, well, this is just an overview of Music S oftware S ynthesizer, which is what we use, soft synth. And we utilize it to create 24 multiple choice music melody hearing tests. It's obviously convenient and portable and fast. All right, so how do we distribute this survey smartly? Well, our approach is... Many people do one sampling method. But here, our approach is to integrate all the different sampling methodologies, cluster sampling, stratified, and some additional clustering within in order to distribute the survey to the right audience to make the survey the most useful. So when you're ready to send out the quiz, how do you do it? Well, I have some examples here. Who should play the music? Well, there are people who know the music and people who don't. So we only want to send the surveys to people who are already familiar with the music, right? Because ultimately, we want to use these people to evaluate the performance of an orchestra. In the stratified sampling sense, we have different kinds of instruments. We may have five students in a particular pool that know how to play piano, we may have two that know how to play violin, and we may want to sample smartly so that we only pick a certain number within each strata of players, people who play particular instruments. So we may pick randomly within each of these strata in a certain sampling rate. And again, with respect to clustering, we can think of location in terms of practice location or geography as a selection from many different geographies. In a sense, we cluster and limit our selection criteria to only the San Francisco Bay area, because practicing in person is much easier than practicing virtually. Okay, so really the point is that this survey dissemination and survey data collection processes is very holistic and increases our chances of producing an effective test set, if you will, of evaluators to help us form the most high- performing orchestra. Okay, so quickly, to wrap everything up, we studied the human hearing frequency range, the instrument frequency spectrum, the music frequency formula, and we designed an innovative music melody hearing test using DSD. We also implemented two interesting approaches to increase the difficulty of the test, hierarchical clustering, as well as rescaling the levels of the most important predictors on our responses for the test answers. And we use the music synthesizer software to basically disseminate the hearing test across the six music melody variables. And in our strategy for dissemination, we use the holistic sampling methodology. So this in closing, some of the approaches that we use and the science that we developed could be used to develop a hearing aid, a music melody hearing aid. And in our current market that we're aware of, hearing aids are really specially designed for people with hearing loss, but the idea here would be, how about making a hearing aid that's about amplifying a certain signal from noise, right? And that would, in effect, increase music melody hearing and detection, right? And so the main objective here would be to block out noise that's extraneous, for example, noise from the audience, and then amplify the signal portion for the particular frequencies that are important for playing a particular instrument, or even using this type of technology to even out the pitch, to amplify the transition between melodies. And so in future work, a similar DSD design can be implemented in terms of developing this kind of technology. So thank you very much for listening and let us know if you have any questions.
A picture is said to be worth a thousand words, and the visuals that can be created in JMP Graph Builder can be considered fine works of art in their ability to convey compelling information to the viewer. This journal presentation features how to build popular and captivating advanced graph views using Graph Builder. Based on the popular Pictures from the Gallery journals, this seventh installment highlights new views available in the latest versions of JMP. This presentation features several popular industry graph formats that you may not have known can be easily built within JMP. Views such as dumbbell charts, word clouds, cumulative sum charts, advanced box plots and more are included, helping you breathe new life into your graphs and reports!     Welcome everybody to Pictures from the Gallery 7. So my name is Scott Wise. I'm a Senior Systems Engineer in the US West Coast and I'm joined with my daughter Samantha. So Sammy, I wanted to ask you, as a 16 year old growing up during a pandemic and still going to school and trying to find your path in the world, what are you most concerned about with the future? Well, to start, I'm pretty worried about how we're affecting the environment, like deforestation, soil depletion, climate change. Additionally, I'm kind of worried about sexism in the workplace, and gender gap wages, and things like that. Okay, well, that's a lot to think about. So you got me thinking as well about what we can all do to help make this a better place. So I thought I'd dedicate my presentation to also emphasizing what we can do to save the planet. You used to say to me, "Be curious...", right? "and do something about it", right? So we can use our curiosity, our time, and our skills. So I'm going to challenge everybody that's seeing this video and myself as well, that we share meaningful data. So a good place to do that is with the Data for Green J MP website. It's a good place to see what kind of sources of data are out there for us to understand things about our environment, as well as to share any meaningful data that we have or any meaningful graphs or reports that we're able to generate. So please do check that out. I'll leave the link in my Journal for you, as well as use your time. So here's a cool use of your spare time. Instead of maybe playing your cell phone games is you could go out to this II ASA website and on this website, they give you images of the rainforest. They take it from the satellite and they ask your help in identifying where there's roads and structures, but also where there's untouched portions of the rainforest. And this feeds their artificial intelligence models to help them do a better estimate of the rate of deforestation in the rainforest. So use your time and then definitely use your skills. And if you're here, you've picked up some good JMP skills to analyze, graph, and explore your data and get inspired from our friends like at WildTrack, where Sky and Zoe are doing a great job. Not only helping protect our endangered species with their non- invasive wildlife tracking methods, but also using analytics, using JMP in very novel and creative ways. So definitely read up on the stories to see how we can do better and we can inspire ourselves to also use our skills to do better. So thank you, Sammy, for all that help. And I'm going to continue on and to show the pictures from the Gallery. And instead of just showing it on fun data, I'm going to show our advanced pictures on data that reflects some data for good, Data for Green topics. So what we got keyed up, number one is this equality. So some equality data on the gender wage gap, which is we're going to do the interval charts, also called a dumbbell chart. That was rated number one in terms of selections this year. Number two would be this pandemic data, which we are looking at how teachers are affected by teaching in the pandemic. And we're using a word cloud. B et you didn't know you could do word clouds within Graph Build. Third one is going to be looking at tree cover loss over our planet from this new feature, which is a smoother line which uses moving averages. Last is going to be some safety data, or next to last, which is the points cumulative sum chart. And so this is a new way of quickly looking at cumulative figures within your graph, even when you're just looking at the points element. So another thing you can do in Graph Builder is now do some, I think, better looking and more advanced box plots. And so we're going to do that on climate change and looking at some projections for city risk going out to 2050. And then if we have time, I will show you a little bit about some things we can do with summarizing vector data, such as wind direction and speed, which comes into play a lot when we got a lot of adverse weather due to climate change. And a wind rose chart is what we're going to feature there. So these are the pictures from the Gallery I will go through each of these individually as time allows here. Now, just so you know, this is a Journal I'm giving to you. It is already out there on the link. You can download it. You will have full pictures of the graph. You will have tips and tricks. You will have, as well, full steps on how to create that graph. And I will leave you with the data. And you will not only have the data, but you will have scripts in the data to regenerate the graphs so you can build your own and you can compare and see if your graphs look like the ones that we created. All right, so let's get going. So our first chart here is gender wage gap, and it's a dumbbell chart. And it's because we have these interval charts. And with these interval charts, you might see that we kind of have large points on the end of the interval. So you're trying to make a comparison between two points, and you can see the distance in between them as a bar. And some people thought these looked a lot like weight lifting. You go to the gym and you're lifting some free weights and you're going to pick up a barbell or a dumbbell, and it's going to have weights on the end with the bar in the middle that you grasp. That's kind of what they think it looks like some people also call this an interval chart. So you can see it's easy to see that males make more than females in France. And there seems to be a large wage gap that doesn't seem to be closing as you go through the years. So let's see how to create this. So I'm going to go back to the data table we just opened up from our Journal. And the secret to making this is you want to have two columns of information to compare. You want them to be over the same scale. And so female and male monthly nominal salaries is what we're comparing here. They normalize all the countries' currencies in the US dollars. And so I should be able to use those in our graph. So I'm going to open up my Graph Builder. Start with the blank Graph Builder. I'm going to take female monthly and male monthly and put them both on the X axis. I'm going to put year on the Y axis. Now I'm going to copy both female and male monthly, and I'm going to move those into the interval. Now it's actually doing the job, but it's hard to see because we have so many countries represented for each year and their intervals are all on top of each other. So let's go under the red triangle where it says Graph Builder. And let's go ahead and do a local data filter. And let's go ahead and just look at a certain country and we are going to look at France. Now I've got France here and it's starting to look like my interval chart. I'll say done here in the Graph Builder to close my little control panels. Most of the things I'm going to have to change here are going to be directly on the graph, such as I'd like to get the ends of these intervals, the points to be bigger. And so to do that, I'm just going to click right on the legend, right click, you go marker size. I'm going to go other, I'm going to make it a big ten. So you see how much bigger it looks now. I'm going to do the same thing with male monthly with marker size I'll go other and I'll go 10. Pretty cool. Now, what about the bar? I'm going to right click, I'm going to go customize, and it's the second error bar you're going to see is the one that's going to show up on top here. And the line style is fine with me. I'm just going to make it gray and I'm going to increase the line width to a two or three. I'll do this one on a two. And now I get the view that I like. Now, a couple of things I'm going to do in the X axis. I'm going to right click on the axis settings, and I'm going to put a pre average salary of $3,500 monthly. I'll say okay. And now I have that line driven. So I can kind of see if things are increasing or decreasing over the years. But it looks like my years are going and kind of from bottom up. I like them to go from top down. So I'm going to right click under axis settings. And I'm just going to reverse the order on the scale. And you can see now I'm going for 2010 down to 2019. I click on that. And now I can truly judge what's going on. Now, one other thing I thought was pretty cool. If I switched the colors here, if I switch them, if I make the female the red and I make the male the blue, then I can bring in a picture. And I've got this kind of great picture you can see a small little icon of it that I think will look cool in the background so I click on it. You can't really see what's going on. It's there. All I have to do is right click, go to images, size and scale, build graph. And there it is. I'm going to right click again and go back to images. And I'm going to go transparency, and I'm going to make that a point three. And now you can see it's kind of more muted in the background. So you don't want to bring in background pictures if you're going to create a complicated chart or you're going to switch between a lot of different maybe filter options or have many different panes open because it can get tedious to have to keep resizing the picture. But it's great for stand alone chart. So here we go. We got our stand alone chart. I am going to close this one down and show you we have them available to you directly as a link in your data table. And the one I was looking at was without the picture. But I did a couple side by side. And this kind of let me look at France versus Germany versus Sweden. And I can see in France, the gap is not as necessary, as bad as I thought it was, there's a bigger wage gap in Germany, but in Germany, everybody seems to be making more money. But if you go to Sweden, it's the females that make more than the males. And so I thought that was very interesting. So here's some good information and data to play with. This data came from the International Labor Organization. And I will put all the links into the data table so you can find them where they are housed publicly. All right, that was our most popular view. Let's take a look at our second most popular view. That would be how do you get a word cloud out of Graph Builder a lot of people asked about this. And all you have to do is we're going to have to create some ordering columns that's going to help us figure out, let JMP figure out how to display my words within a cloud- like shape. So I'll show you what's going on. So I'll open up this educator for top five COVID words. So they went out... An education article went out in the state of Kentucky and they just said, "What are the top five words you think about when you hear COVID?" to school teachers, right? School teachers have been by far adversely affected by the pandemic, having to try to teach the schools as they close, as they reopen, as they do it with mixed media you have to do some classes live, some classes on site, mixed classrooms. So it's been very difficult. And you can see some of the words that have popped out get captured here in this column and you can see how many respondents said those words in their top five. Like 20 of them basically said... The word anxious showed up to 20 of the respondees. As a top five word associated with the pandemic. So I have that information. So I created two columns. The first column is a random column and it's just random normal information being assigned to the rows. And I got it just from creating a column, right clicking, going into the column info initializing some random data versus initialized data. You can do random. You can pick a random normal. I figure if you have a mound shape, right, a bell shape distribution, it's kind of like a cloud. Most of the stuff is going to be in the middle and less is going to be out toward the end toward the tails. So I did that and that's how I came up with this random data. I'll go ahead and delete that one. For the order column, it was pretty easy. I just sorted by weight and you can see so anxious was 20, then 17, then 17. So basically the order is just a row sorted order. So anxious is number one because it was biggest and that's going to help me later make the word cloud. So let's go ahead, click on our... Graph Builder. Let's go to... Before we do anything with the word, let's actually go to the random. And let's put random on the Y and it's doing what we expected. It's distributing the rows out randomly on the Y axis, just kind of separating them out. What if we swapped those points for the word itself? So if you go to points under the red triangle and you click on it, let's go ahead and select set shape column. When I click on that one, I'll go ahead and click on the words and there's the words. I'm going to take the weight of the words and I'm going to size it. Oh that's starting to look like a word cloud, right? Now I'm going to take that weight again and maybe give it some sort of scale coloring there. So I guess the more red, the more frequency we had in that word, which is kind of cool. Now what's going on is it's doing a jittering and it's doing centering grid jittering. That's actually the automatic default is the center grid so that looks pretty good and look when you're done as you move it around, it will... it will try to adjust the words within that shape and try to hold that type of scale. That's really easy to do. I'm going to right click under the red triangle for Graph Builder and under legend position I'm going to go inside bottom right. It may be inside bottom left. I think I like that one better. I'll put it right to the inside bottom left and I'm going to go to legend settings and I'm just going to keep the bit that talks about the color so that's pretty cool and that's how you would do the random word cloud. How I would do the ordered one is I'll go back to the red triangle under the Graph Builder I'll say show control panel. I will swap out random for order. Now all the big ones are on the bottom and the smaller ones are at the top. That makes sense. Maybe I'll right click on this axis setting. You saw me do this earlier. Go back and reverse the order so it's going from 0-30, with the ones and the twos and the threes being more on the top. There we go. That's exactly what I wanted to see and now I have an ordered one. Now with ordered low, a lot of times it might make sense. Here if I really wanted to have it in order ed order, I need to use a certain jittering that's going to kind of justify the ordering. So I'm going to show back the control panel. I'm going to go to the jitter style instead of center grid, I do positive grid and now it really is ordered, going from left to the right, top to bottom with the number one thing, then the number two, then the number three and then the next line, the number four, and so on. And sometimes people prefer this kind of word cloud, because... it makes it easier to judge which one's bigger than the other if the size and the color are similar. Now you can make a judgment. Now it's looking good from what I want it to do, but it's not looking good on the graph. What are we going to do about this one? So I really would like to move this whole thing over to the right. So to do that one, I'll take my control panel back open. I'll go under this X axis. And even though there's nothing there, I'll right click. I'll say access settings, and I will do a negative point three for the minimum. And now it's moved over. And now this looks a little more like an ordered word cloud. And of course, you can bring in pictures as well. And I did that on this one. I brought in a nice apple with a little picture with a little bit of transparency on it and make the words pop out on top of it. All right, so instructions are there if you want to try this one. But I can see a lot of people having, again, a bunch of words, a bunch of categories, a bunch of phrases. And if you got a weight on them, got to count on them, you can make a work cloud. All right. So let's see here where we are. We are ready to go to the moving average smoother chart. And this one's pretty cool. It's a new smoothing line option that lets you kind of look for trends. And a popular way to look for trends is a moving average. When there's a lot of noise in... the things that you're plotting over time, then sometimes you want to smooth it. You see that blue line here is smoothed in between the ups and downs of those points that are being collected over the years. The other thing it's doing here, as you can see I don't have a legend. I've actually labeled the lines. I'm going to show you how you can do that well as well. All right, so here I'm going to click on this tree loss. What we're looking at here is reasons for loss of tree cover or deforestations. This came from the Global Forest Watch, and they have Tree Loss in hectares. So we'll go ahead and show that one. Here's tree loss in hectares and I'm going to throw my year down here. And I've got both points and smoothers automatically showing. So all I really do is just got to take the drivers and overlay by the drivers. That's pretty cool. Now I've got just the s pline method and I'm not as big on that one format showing a trend. So if I click on the options, here's where you have more options in JMP 16 and I can do the moving average. I'm going to do the moving average. I'm going to move back this little local width toggle so it fills out the graph. And I'm even going to include this confidence fit. So this is looking pretty good. So I'm going to say done here. Now I'm going to go under the red hotspot for Graph Builder. I'm going to go to the local data filter. I'm going to add drivers and I'm just going to look at the first three drivers. Okay, so these are looking pretty good. Now I've got this legend way over here and it's not really adding to the graph. I would like to do something better. Well, I'm going to right click right on the legend line for in this case the agricultural shift. And I'm going to go to label, and this is brand new in 16. See, I can do min maximum values, first values. But I can also do a name. Look where it puts it. It looks it to the right in the graph. I'm going to do the same thing with each of them. Each of the ones I have open. Now I'm going to go under the red triangle for the Graph Builder and go to show and turn off the legend. I don't need it anymore and I don't want them on the side. Look what happens when I start to move it back. You can see I can move it into the body of the graph. And I'm going to stretch this out a little bit on our screen and you can see agricultural shift. If you put it close to the line, it tries to hug... the slope of the line which is really cool. I'm going to put agricultural shift there. Maybe I'll put commodity driven just over here and maybe forest driven right in this area here. And now you don't need the legend. Now you can just see what's going on with the light and kind of let people's eyes go where they have interest in the agricultural shift. Here was the real story, how this was more of an assignable cause of loss of tree cover, but it has gone down a little bit here in recent times. All right, so pretty cool graph as well. So smooth line moving average. Points cumulative summary chart. So this was cool data. This was actually data on driving safety. And kind of the idea was if we could point out over the years times when there's been major releases in car safety, are we getting safer? Right? As more vehicles are on the road, we expect there to be some more crashes just from the volume increasing. But are we actually having less injuries as our cars helping us with airbags and anti lock brakes and these things? And we did this with a points chart. And we really benefited from using cumulative sum options that are new in JMP 16 right off the points chart. You can always create a cumulative sum column very easily in JMP. But we think this is better. I'll show you why. So if I go to this motor vehicle safety and this came from the US Bureau of Transportation Statistics, I have... crash rates and I have injury rates. And the rates basically have taken the total divided by the vehicle- miles in millions. So you kind of get an index. So I'm going to go open up Graph Builder. I'm going to take crash rate and injury rate and put it on the Y axis. I'm going to put year on the X axis. I'm going to pull off the smoother line and just leave the points. And you can see why this is not so great, because what's going on right now is I see a trend. I see a trend of crash rates kind of going down over time, which is good news. But is it different than the slope of the rate of change going on by injury rate? Scale can make it hard for us to make that judgment. And you're just trying to compare patterns here and they're not next to each other. They're ones above the other one. It's kind of hard to see. So with one selection here under summary statistic, there's a new cumulative sum. Now tell me which one you think has a steeper slope. And the crash rate is going up at a pretty good cumulative growth rate. But you can see the injury rate is kind of leveling out a little bit. It is going up, but not as steeply. So I could summarize that the vances are definitely... We're still getting into a lot of crashes because there's more cars in the road. But it seems like we are... saving ourselves some injuries here and that's a good thing. And so the fun thing I was able to do, you can see back in the data, I was able to put in some... safety innovation. So like in 1996, we had side impact testing. We had dual front airbags in 1998. So we can go in and go to our a xis setting. And this is a good place for a reference line, 1998, we'll put a dashed line, maybe a dash gray line, and we will say air bags. We'll add those and there you go. And now we can see where the air bags came in and what the performance was. Now the other thing we can do is instead of using points, we can go to the red triangle where the points are. And we can use the shape column here as well for something that's not even categorical, for something that's continuous, like year and it will still bring in those values. I'll say done. I will make these a little bigger with marker size. Maybe... make them pop out a little bit more on our screen and make that a little bigger, maybe even tuck in on the legend position. Put this on the inside left, and by the way, you see where these highlighted areas are on my screen, you can put them in the corners or you can put them again out on the side of the graph. Just drag them there. But there we go. And now we can kind of see how these early advancements have really done the job to protect us, so even though there's more of us and there's more traffic and there's more cars and there's going to be more accidents, just hopefully less serious ones. Very cool. All right. So hopefully you're enjoying these like I am. Remember, I'm giving you this Journal so you can go and recreate them at any time. We will show you the fifth most popular chart, which is the advanced box plot. Now, on this chart, this is using some JMP 16 new features that allow us to really help the Graph Builder graphics to give us a lot of ways to visualize the box plots. And also, you're going to see that this is a good way we can do some interactive labeling as well. So I take this data that I found from Nestpick. It's climate change city index. And it was quite interesting. They came up with a total climate change risk and it went from 1-100. And they base that off of climate shift, of temperature shift, and potential sea rise, and... ...water stress, and that was the one I really hadn't thought of, like, will there be water around your town in 30 years? So in 2050, it said that these were going to be the cities based on total risk that were the most going to have the most problems. So Bangkok was number one, Marrakesh was number ten going on down. So I turned on the labels for these rows so it will pop up in our graph. So let's take a look at it. So I'm going to go Graph Builder. I'm going to take the total climate change risk on the bottom. You can see there's already starting to be some things getting labeled from having asked for labels in my data on those rows . I'm going to look at it by region. So I got three areas here. It's looking at it by region. And now I can say, well, let's make this a different color. Let's make this red. I want to right click there, right on the points on the Legends and make it a little bigger here. Now the points are standing out a little bit. Now let's hold our shift key down and let's add in a box plot. Now this is a typical box plot. It's not that interesting. You might notice Amsterdam showing up twice and Bangkok is. That's because the box plot is labeling the outliers. Well, the points are already represented so we can turn that outlier box off. I don't like this view low. I've always kind of liked the saw with views, but I've got points showing up well with this open box plot. But if I change the box type, not the box type. Actually, the box style, to solid, all of a sudden it's gray, it covers it up, and I don't see the fences of the whiskers kind of where my normal spread of point should be, I lose that. Well, not exactly. If I go to box plots now and I just click on the red triangle, I can do notched and I can do fences. So the notch actually kind of notches the figure right where the median is. My boss likes to call this a torpedo. Looks like a torpedo. And then I can add the fences back to see where the ends of my whiskers are from my box plot, which really does help. Now I'm going to say done. Now you're going to say, "Well, Scott, it's still covering up the points." When I make this really big on our screen, all you have to do is right click on the graph, go to points and say, you know what, move that forward. And there you go. And now I can see where all... my top ten at risk cities fall on this list. And it was interesting. I heard some surprises that I didn't expect. Like, I thought that New Orleans here in the US, because it's so low and so prone to floodings, especially from natural disasters like Hurricanes, it would come out higher. 2050 maybe Boston is not a great place to be. So that was really interesting to me. Fun data to play with. And this is some fun things to do with points and box plots. All right, so we are nearly out of time. I will just quickly show you the last one is this wind rose chart. And the wind rose chart is a way of looking at vector information. In this case, you see all these little arrows here. Those are all measured wind speeds and directions that came out of looking at the Great Lakes area around Chicago. And so they wanted to summarize it. So they were able to come up with a special type of pie chart. It's called a coxcomb chart that actually will let you kind of mimic a compass. So the compass rose is kind of what you're used to seeing on like compasses on the face, right? Yo u know you get North, South, East, West. Well, we've kind of doing the same thing with the wind rose. So it's kind of summarized all this data. So to make it, it's pretty easy, we just go ahead and we've got our positions for our wind. We have the speed of our wind, but we also have, in this case... the direction of the wind. And this one went into the 230th degree of the compass so that would be a west southwest. So when we go on the graph it all we have to do is take those sections that we've identified, pop them on down here, ask for the bar chart, ask for the coxcomb chart and I'm going to take the speed and I'm going to put it in the overlay and now it's easy to see that a lot of my data is coming out of... I put that one right there. Now I can see a lot of my data is coming out of the western section, the northwest in particular, especially with the larger orange segments where it's got a higher count and of course I brought in a nifty... background map and so you can make it look really cool and if you want to learn how to draw those arrows I've included those as well. That is something you would do under points, set shape, expression and I'm showing you how you can put in just a little of JSL scripting and you can draw these wind directions and the length of the arrow is the strength of the wind. That's pretty cool. All right, so we are right at time and I'm going to include in your Journal where to learn more so you can learn from the other galleries, you can learn from the other blogs and journals, the other presentations as well as other tutorials just off the JMP community so please do learn more about Graph Builder and please do share your data at the JMP Data for Green. All right so please email me or contact me if you'd like to talk more about Graph Build or see any of these views differently. And I thank you and I hope you enjoy your discovery and please do go help save the planet and get curious and share your results.
Do you want to build an analytics culture within your organization? In this presentation, we discuss how to develop an analytics strategy and advocate. There are facets to an analytics culture that require significant change that must begin with leadership and advocates within the organization who can set the tone and lead from the front. The analytics advocate must work to promote data as a strategic asset.    This presentation addresses how advocates can facilitate change, overcome resistance, promote collaboration, and educate and empower their workforces. They must find additional stakeholders to help execute a unified vision for change within the organization and adopt a plan that can result in an educated workforce to foster a successful analytics program.   As a tool to help upskill your organization’s workforce, this presentation also outlines and highlights unique ways companies can use content from Statistical Thinking for Industrial Problem-Solving (STIPS), a free, online statistics course available to anyone interested in building practical skills in using data to solve problems better.       I'm Biljana Besker, and I'm a JMP Account Executive for Global Premier Accounts. My colleague Sarah Springer and I will introduce you today on how to become an analytic advocate in your company and what resources we have available at JMP to help you get there. So let's start. What defines an analytics advocate? The analytics advocate must be an advocate and a change agent who spreads the analytic strategy and fosters an analytic culture that everyone is comfortable using data -based insights to improve the quality and effectiveness of their decisions. Five characteristics to look for in an analytical advocate are, first, credibility. They are trusted and well respected because of a proven track record of managing difficult projects to successful completion. And they have empathy because they listen to and addresses fears and resistance to change as new steps are taken on this unfamiliar path. And of course, they are problem solvers. They are willing to roll up their sleeves and work to overcome technical and cultural challenges that arise through each stage of implementation. They always show commitment. They support the analytics strategy and promote consistent interpretation of the goals for analytics. And they are flexible. Data-driven decisions require ongoing evaluation of their effectiveness. An analytics advocate must recognize when the part of the analytic strategy is not working, and work with all parties to redefine the solution. What is considered to be the analytics advocate's role? As an analytical advocate, you must promote data as a strategic asset and you have to address resistance and promote collaboration. And there is a big need to promote a culture of evaluation and improvement and to educate and empower the workforce. So assets do not necessarily have essential value, and assets are associated with liabilities. So how to promote data as a strategic asset? Analytic is about having the right information and insight to create better business outcomes. Business analytics means leaders know where to find new revenue opportunities and which product or service offerings are most likely to address the market requirement. It means the ability to quickly access the right data points to find key performance and revenue indicators in building successful growth strategies, and it means recognizing risks before they become realities. So how can you address resistance? There are three levels of resistance you must overcome. The first and most important level is the C-Level resistance. Preparing the technical infrastructure for an effective analytic program may require significant resource investment for an unknown return. This should be addressed where possible with a small project recurring minimal infrastructure to secure a quick win with positive expected ROI. If this is not possible, then show examples where others in the same industry have benefited. Second is the Department- level resistance. Process owners may resist the perceived effort associated with data governance processes needed to make data cleaning sufficient to support analytics. The analytic s advocate must find ways to show how such efforts will result in recurring long -term benefits to the organization that will turn in regards and recognition for the Department. Again, quick-win projects can help. However, the analytics advocate should not stop there. Important tasks are best accomplished with a dependable ally, with shared interests. And last but not least, we have the Front-line worker resistance. As with business process owners, front-line workers are not interested in extra work as we know if it's not reflected in the metrics used to assess their performance. A smart analytics advocate address that question, what is in for me? Integrating analytics solution into existing workflows reduces incremental effort and empowers front-line workers to make more informed decisions and improve job performance. So how to become an effective advocate of analytics? As an analyst, you are obviously aware of the power of data analysis. You know that the application of appropriate analysis techniques to a well constructed, meaningful data set can reveal a great deal of useful information, information that can lead to new opportunities, improvements and efficiency, reduction in costs and other advantages. While many organizations have adopted analytics on a wide scale, several others still employ it only in certain areas, and some, believe it or not, rarely use it at all. If you often get excited thinking about new ways of applying analytics in your organization and are eager to share your excitement with people you think would benefit of analytics, you are in a good position to become an analytics advocate in your company. So first, focus on the person's greatest challenges and most burdensome tasks. Everyone has something about their job that is source of frustration, no matter how much they love what they do. For the person you are working with, a meaningful application of analytics is one that relieves his or her frustration or minimizes it as much as possible. As long as the application is also important to the overall business, this is a great way to begin to show someone the true value of analytics. It's also a good idea to start small and then work your way up to bigger projects later so that you are not overwhelmed and thus don't run the risk of not being able to deliver. Second is incorporate their knowledge and expertise. You may be an expert on the application of analytics, but you are most likely not an expert on every functional area of your organization. Not even the CEO can make that claim. Therefore, you must rely on the insight of others to help you understand all of the complexity that cannot be contained within the data set, including any legal, ethical, or other considerations that must be taken into account. What's more, you are demonstrating respect for their specific knowledge, which will help build trust and make them more eager to work with you. Third, learn to speak their language. Being able to understand and communicate in the terminology used by the people you are working with will demonstrate that you are willing to meet them on their terms. It's not a two -way street. However, avoid using analytical and statistical terminology as much as possible . If necessary, practice finding ways to explain difficult or complex concepts in an easy to understand manner. Metaphors often work well for this. Fourth, publicize your victories and show the credit. Sorry... and share the credit. Once you have successfully completed the project, be sure to tell your boss . Ask him or her to spread the word throughout the organization and externally if possible. But make absolutely sure that the credit is shared with those assisted you in the project. This will help build attention to the power of analytics within the organization. As well as make those people you've just worked with feel rightfully appreciated and respected. If you look closely at these four recommendations, you'll notice they all have one thing in common. They put the focus on what you can do to help others. Whatever you follow these specific tips or not, as long as you promote the use of analytics as a service that can help a person solve a problem that is important to them, you will go a long way towards fostering a positive attitude toward analytics throughout your organization. But how to become a successful advocate of analytics? Put user experience first . For companies, it can be tempting to overlook the role of the end- user and focus solely on business outcomes, which is why the analytic advocate must ensure that the focus remains on the value and overall experience for end- users in addition to the positive business outcomes the company wants to achieve. To bring us back to our earlier discussion of low adoption and analytics strategy that does not consider the user's position and needs is at risk to become a strategy that is technically capable but not valuable enough to keep users engaged. To mitigate this risk, the analytics advocate must be able to explain both: the benefit of the analytics strategy to the business, but also ensure that the strategy is beneficial for the end- users who need to make business decisions. And push the analytic strategy to evolve. Of course, user and business requirements change over time. So once the strategy is launched, the analytic advocate must ensure that the strategy evolves to meet those demands . Without repetition , the strategy runs the risk of outliving, its usefulness and driving adoption rates down as a result. Instead, the analytic advocate must monitor, manage, and drive the strategy forward to ensure ongoing utility at maximum business value. Companies who want to introduce an analytics strategy can make themselves much more likely to achieve success by putting that strategy in the hands of someone who can understand end-users and push the project to improve experiences and business outcomes. Who understands the analytic represents a journey and not a destination. Successfully appointing an analytic advocate is the first step in this process. Let me summarize what we just learned. At the most strategic level, analytic allows organization unlock latent value from their data to gain insights, accomplish business objectives, and improved profits. While these insights should empower everyone in the organization, many organizations resist the cultural changes needed to benefit from an analytic program. As first step, executive leadership must establish and support the analytics strategy. Then , designate an analytic advocate to engage stakeholders to unify that vision, understand and address pain points, overcome resistance to adoption, and demonstrate the value from analytics through quick win projects. All organizations can better accomplish their mission of leverage and analytic with a data -driven decision process. Using analytics to achieve a sustainable competitive advantage and generate significant return on investment begins with a well -convinced analytics strategy and roadmap for success that is aligned with and supports the overall business strategy. And with that said, I would like to hand over to my colleague Sarah Springer, who will show you how JMP can help you to become an analytical advocate in your organization. Thank you. Hi, I'm Sarah Springer. Biljana, thank you so much for providing that great overview of what makes a good analytic advocate within an organization. I'm going to look a little more closely at a couple of those areas that Biljana touched on, and we're going to talk about a process and some tangible resources that will assist you and your organization in building a culture of analytics. So how can JMP support your organization in becoming more analytical and how can we support an analytics advocate? So we've outlined a process here to help you accelerate your organization's analytics growth curve. So that process is going to go through a couple of steps. So first, we're going to talk about how to build a team of data ambassadors. We're going to talk about the best way and some resources to identify key use cases and define success. We're going to talk about how to establish an efficient data workflow, how to educate and what resources we have available to educate and upskill yourself and your colleagues, how to socialize your analytics successes, and then how to democratize data and the process. So Biljana touched on this in her presentation. But what is in it for you? If you're an analytics advocate within your organization, what can this do for you as an individual? You can be a vision setter and a change- agent throughout your organization. This is an opportunity for you to make a real impact on the lives and the well being of the people in your organization. You can be a subject matter expert. If you identify a specific problem or a specific area of deed and upskill yourself in that area. You can really be looked at as an SME within your organization , and gain some recognition for yourself within your organization and within the JMP community. You'll be gaining credibility. You'll become a leader in teaching others the skills that you've learned and upskilled on. And then ultimately, we've seen a lot of our analytic champions throughout all of our organizations really have a strong resume and advance in their careers because of the great work they've been doing at their organizations in building a culture of analytics . Data is everywhere. Analytics is an important competitive tool. And it's really past the point of being able to not have analytics embedded in your organization. And so what we've seen is individuals that have been in analytics advocate within their organization have been quite successful , and so this is a real opportunity for you. But what Biljana mentioned is it's really also about helping others and making an impact on your organization and the world around you. And so what's in it for your organization? Advocates play a key role in demonstrating value and ROI . You're able to pick a project, a real challenge that you or your organization is having and show the true value of the impact that analytics can play to your organization. Adoption of JMP or an analytical tool or an analytical culture can really help , again, bring the organization into the digital transformation age. It is at the point right where we can no longer not take advantage of all of this data that we have . And so in this role you have a real chance to make an impact at your organization, and again impact your organization's bottom line, save your organization money by improving processes or producing less waste. Other use cases we've seen are securing time -to -market and you're really helping your company to stay competitive. So the first part of the process, as you're thinking, how do I make an impact at my organization is to think about, as Biljana mentioned, who can come along with me in this journey? Who else is feeling the same pains? Who else can benefit from a strong analytic culture? So looking beside you into other departments, but then also up. Who at the executive level? What executive sponsor might be interested in some of these pain points I'm having? How can I get stakeholders to support this movement? How can we get buy -in early? And so the goal is to find reliable, passionate, accountable people that are maybe having some similar challenges as you to walk through this journey with you and to help, as Biljana mentioned, show the value and prove to leadership and prove to stakeholders that this work is valuable and deserves attention and investment. Once you have your colleagues and you have buy -in and you have your team, what we want to do is the next step of the process is really looking at identifying key use cases and defining success. So what success looks like is an important part of defining an analytics strategy . Thinking about some common use cases within your organization, maybe you have too much data, maybe there's data not being used, too many systems to get the job done, not a good way to share decisions, maybe there's a lot of wasted time being able to sift through of all of that. So figuring out what is my organization's challenge? What would success look like? How can I move the needle? And as Biljana mentioned, starting small is important. We want to think maybe of something that's not a huge undertaking, but maybe has a broad impact. So as you're thinking of the right place to start, these are some things to consider . Some great resources that I would recommend as you're thinking through this, if you want to look at your organization's annual report or 10-K, if you're a public organization, or maybe there are some internal documents that are outlining for the year, what your risk factors are as an organization? You can get a strong overview of some of the concerns that executive leadership has about maybe some of the risks that they could approach this year. Often, we see some risks across R&D and manufacturing that might be very relevant to being able to solve that problem with analytics. Maybe it's time- to- market. Maybe it's improving any sort of defects in the manufacturing process , right? Those are things outlined in your organization's 10-K or your annual report. And that could be a really great win for you, right? If you can pull some folks together to want to solve one of those problems or improve one of those processes. The other resource I would recommend taking a look at is going to the JMP website and looking at our customer success stories. There's a whole library across industry and challenge that could really get your wheels turning and give you some great ideas about some possible use cases and what success might look like. So once you have your team and once you have the goal, next you want to think about how to create and establish an efficient data workflow. So it's important that you're able, in order to do great analysis, you want to have good data access. You want to be able to streamline that process. How are you pulling, analyzing, and sharing that data? How are you getting the right information to the right groups? Where is your data? Is it accessible? Is there anything you can automate? Can you make anything easier? Can you use JMP Live to share information? So there's a lot of things to take a look at. Tangible resources for this include conversations with IT. Maybe you can look at possible scripting or automation within JMP or within your analytical tools to really make an impact, to make this as easy as possible so that it's in the hands of the right people who can solve these real problems and contribute to your success and the success of your organization. So next, we're going to talk about a step that is very close to my heart, training and upskilling colleagues. I spent some years at SAS within SAS education, helping JMP users do just this. And so I wanted to touch on, once you have your team, you define your goals, you've got your data access in a good spot. We want to talk about how do we give our team and employees and users the tools and the knowledge to execute this plan. What we're finding... There was a survey done by HR Dive studio ID and SAS that was conducted in October of 2020 . And they found that this is a huge need. 88 percent of managers said they believe their employees' development plans needed the change for 2021. A lot of this was coming out of us shifting into a pandemic world and people working remotely . And folks are really asking for training and for development and for help. Out of the survey, 50 percent of managers said, employees needed more upskilling, more reskilling, and more cross-killing . And 41 percent of the employees themselves said the same thing. And that when considering the types of skills employees should focus on, employees needed more technical skills. They really , really want to build their skill set. And so as you're thinking about how to build an analytics culture, training and upskilling is really important. And people are wanting those more technical skills so that they can make a contribution in this age of digital transformation. So the survey also brought to light five major learning and development trends. So I just wanted to highlight this as something to be thinking about as you think about your strategy. The trends were that companies are now expected to take on more responsibility for employees and society, and making sure that they're getting what they need, that people are being taken care of . And companies need to match... Another theme was that companies need to match their technological investment with the learning and development of their people. Learning and development are much more universal, and it's really a strong recruiting and retention skills. And again, these hard skills are really in demand. And so as you're thinking about your strategy, it's so important to think about how can I help my colleagues, my organization, get the right knowledge and training in their hands so that they can really be impactful with all this data that we have. So here at JMP, we have a couple of resources I want to point out that can be really powerful to help upscale your organization. The first that we're going to touch on is the Statistical Knowledge Portal. Then , we're going to take a deeper dive into STIPS, which is our Statistics Course. It's a free online course called Statistical Thinking for Industrial Problem Solving. It is award- winning, and it is self- paced, and a wonderful resource that many of our customers are using to provide analytical development to their employees. And then finally, we do have some formal SAS training and some resources that I do want to point out as we go through this process. So the Statistical Knowledge Portal is a great site. I've put the link there for you that has information in all of these different areas that I've listed. So it's a great way if you have somebody that needs to know a very specific skill, they can go on here. They can pull some resources and skill up fairly quickly. It's a great way to get them started, to get their feet wet, to develop some knowledge, to get some tips and tricks. I would highly recommend spending some time on this website. And then you, as the analytics advocate, can really help drive the person to what skill they might need based on your project. So I think there's a lot of collaboration that can happen here. But I do want to point out that this is a phenomenal free resource on the JMP website and has a lot of great statistical information for you. The next one I want to point out is STIPS. So all of these modules that are on the right are self- paced. They're deep- dive d into the topic area. They are hands- on exercises . And it's really going to help you get up to speed and understand that statistical concept. So as you're working on your project, as you're working towards your goal, think about different areas of this course that might be helpful. There is a great overview module at the beginning as well that I would recommend that talks about what different processes you can use to begin to be thinking statistically throughout your organization. So it starts at the beginning, and then it goes all the way down to advanced modeling, so it can really meet you where you are. And we'll talk a little bit more about this course. So what I love about this course is this is really something that JMP has put out because they want folks to be strong in analytics and they want folks to understand statistics and to understand the why behind what we're doing. And so we've put out some additional resources to help companies upskill their teams. So you can take this course in a self- paced format. But we've had many, many customers want to use these materials in a different way. We have customers doing lunch and worn throughout their organization, having sessions where they'll take a specific concept from the course and have discussion groups. We have professors and universities using steps or some of this material as prerequisites or even within their statistics courses. And so what we've done is we've made teaching materials that has put some of this material into PowerPoint slides for you to use at your organization for some internal training. And we wanted to make that easy and accessible for you. So going to jump.com/ statistical thinking, there's an online form on the right hand side of the page where you can fill that form out and get these materials to use to help upskill your team. And then finally, the third training resource I wanted to touch on is Formal SAS Training. SAS has incredibly strong, relevant trainings, hands- on trainings that provide a real great depth and understanding of different concepts. So there's lots of different formats for individuals, large groups, small groups. And I've put the link up here so you can go check out those courses as well. It's a really great way to upskill your team and then make sure they have the right tools. And if you don't know where to start, I definitely wanted to highlight a tool that Fast Education offers called the Learning Needs Assessment. It's a data driven survey that can be distributed to your team according to learning area of what major areas that we often see our customers needing. Maybe the major courses that SAS Education offers around design of experiments or scripting or a Novel regression. These are great resources. And if you don't know where to start, we want to be able to survey your team, identify what their preferred learning style is, identify their competency in these areas, and then put it in a report that's easy for you and managers and executive leadership to understand. And then we would work together with you to make great recommendations. And those recommendations could be use of steps, it could be use of formal training, it could be complimentary resources from the Statistical Knowledge Portal, but it helps give you an idea of where to start if you're not quite sure according to your project and your goals, what your skill gaps are, sometimes you need a little help identifying those. So that's what this is for. So once you've got your team upskilled, once your team is trained in the areas and you're working on your project, it's time to document those successes, right? Biljanna touches on this as well. You want to be able to document that as a proof of concept, show the value to your organization, continue to get that commitment and investment in your work, your team's work, in the power of analytics, continue to help your organization move towards digital transformation. And so being able to document these successes are important. A couple of resources that I've found helpful to do this, I think the main one is our customer success program. We do have a great program, I mentioned earlier, where you can get these stories published on the website, but we've also helped some organizations with internal stories. Ask your account team for help. We want to help you document these successes, and so we can certainly help you do that. And then often we can help you do that if you want to tell a story in JMP, we would love to help you show the impact that you've made for your organization. And then finally being able to democratize data in the analytic process. So this is one of those steps. How can we take what you've done as a group and then spread this further throughout your organization? Once you've done this and you've documented your successes, I'm sure you're seen as effective as a leader, you're probably well respected. So now you get the opportunity to make even broader of an impact on your colleagues and bring them along with you in that success and make an impact on your organization. So what you can do from here is really empower more people to be more data- driven. And I think using things like I mentioned with some of the steps tools, maybe you're leading lunch and worms, maybe you're creating a user group, maybe you're doing an internal newsletter about the power of analytics, maybe you're working with the JMP team to do doctor's day in sessions or different sessions around features that have been very helpful to you with your project. So this is your opportunity to help others, and help others at your organization make an impact, and help your organization shift to be more data- driven in today's digitally transformed world. So I want to leave you with some tangible next steps. We've gone through a process of how to build a culture of analytics. And as the next step, JMP has a great resource, jmp.com/advocate, where you can go and learn more about each of the steps that we've outlined today and what resources are available to you that corresponds each step. So today has been a great overview, but if you do want to take a tangible step to move towards an analytical culture within your organization, I would highly recommend that you go here and check it out and then don't hesitate to reach out to your account team. We're all here to help and we want to support you in making an impact at your organization and around the world. Here are sources for your viewing pleasure. And I just want to thank you very much for your time and attention today. Be well.
Monday, March 7, 2022
JMP provides many data importing methods, allowing you to retrieve data from large databases down to simple text files. But what if your data is in a unique file format that JMP cannot directly import? By using the powerful but little-known JSL Blob (Binary Large OBject), you can extract that valuable data for use in your analysis. This presentation offers a case study in developing an import function for a common but obscure data source: the metadata locked within a digital photograph. Modern cameras record much more data than just the image. Metadata -- how, when, and where the photo was taken -- is locked away within its binary EXIF tags; with JSL Blobs, we can import that data into JMP. From there, we can use JMP's analytic tools to measure metrics of our photo library, and even produce a map-based display of our photos.     Hello, I'm Michael Hecht, and I'm here today to talk to you about importing binary data in JMP using JSL. I'm on the team that develops the software for JMP, and let's get started. JMP can import lots of different file formats, everything from plain text to Excel spreadsheets. When those are imported into JMP, they're shown to you as a data table that you can then use for further analysis with all of JMP's capabilities. But what if there's a data type that you can't import? JMP doesn't know how to do it. In this case study, I'll be looking at the JPEG image format that is pretty common amongst all digital cameras and smartphones. I'm sure everyone's familiar with it. In fact, you might be saying, wait a minute, I thought JMP can open JPEGs, and in fact, it can open a JPEG. But when it does, you get an image like this. But there's more in that JPEG than just the image. For example, if I get information on this one, I see what kind of device the image was taken with, what lenses were used, and even the GPS coordinates of where I was standing when I took the photo. Now, how can we get that data imported into JMP? Well, we can do it through JSL, which I'll get to in just a minute. But if we open this file in a text editor, we see that it's not human readable text. It's a series of bytes that are shown as these unprintable characters, and we call that binary data. It's data that's outside the range of normal alphabetic text. It has a structure, though, and the data is locked inside there. If only we can determine how to unravel it. To do that, we need to know the specification of how this data is laid out. In the case of JPEG, that's defined by a specification called Exif or Exchangeable Image File Format, and we can download the spec for it. It's a document that's been around for about 20 years, and it's in use by all the devices that produce JPEGs. Not only hardware devices like cameras, but even Photoshop puts metadata in a JPEG in this Exif format. To access it from JMP, we need to use a JSL object known as the BLOB or Binary Large Object. This is just a JSL object that holds a sequence of bytes. Like the name says, that sequence of bytes can be large. We can actually create a BLOB by loading the contents of any file on your hard disk into it using this Load T ext File function. Normally, that would return the contents of a file as text, but if we add this BLOB keyword as a second parameter, then the function returns a BLOB. We can take one BLOB and subset a part of it into another BLOB using BLOB Peek like you see here. This is taking 50 bytes from b, starting at offset 100. Now the offset for BLOBs is always starting at zero, so the first byte in the BLOB is an offset zero. Now we could do both of those operation in a single function call by passing the offset and length as parameters to the BLOB keyword when we call low text file. If you see here that says, open the file at this path, skip 100 bytes in and then read 50 bytes, and return that as a BLOB. Once we have a BLOB, we can convert it to a character string using the Blob To Char function. Here we're taking b2 and converting its bytes into a character string, assuming that those bytes are in the " us-ascii" charset. If we don't specify a charset, JMP assumes it's UTF-8 . We could also consider that BLOB to contain a series of numeric values, all of the same type and size. Using Blob to Matrix here, we're taking b2, which we read in and set to be a length of 50 bytes, and interpreting it as an array of unsigned integers, each of which is two bytes long. We should get back a matrix with 25 numbers in it. Now that fourth parameter, the string "big," says that those unsigned integers are in big endian format, meaning the first byte is the most significant, is the highest part of the end, and then followed by the lowest part of the end. We could also specify the string little to specify little-endian format. Binary files have both of these kinds of representation of integers and other numbers. In fact, the Exif format uses both big-endian and little-e ndian. Let's take a look at these operations in action. I'm going to switch over to JMP, and I'm going to open this demo script here, D emo number 1. Now we see some of the code we just looked at. Here we are loading a text file, this file named Beach.j peg. This is the same file that I used in my slide. It's right here and you can see it. You can also see that it has a size of about 3.4 meg. When I run this one line of code and log, it tells me that b was assigned a BLOB of 3,45,000, et cetera bytes, or 3.4 meg. It doesn't show me all those bytes, but I can see how big it is. I can get that length using the Length function, just like you use for character strings. But when I use Length on a BLOB, it gives me back the number of bytes. I can get a sub- BLOB of the first six bytes in b using BLOB Peek. We'll do that here, and I see a BLOB of six bytes was assigned. I can actually look at the value if I want by just submitting the name of the variable. I can see here are those six bytes in this "ascii-hex" format FF-D8-FF-E1, et cetera. I can take these six bytes and convert them to a matrix, and I'm going to convert them to a matrix of two bytes unsigned int or shorts in big-e ndian format. Given that there are six bytes here, I should end up with a matrix of three numbers. When I run this, sure enough, there's my three numbers. We can see those three numbers in Hex just to verify with this little four loop that I wrote, so let's do that. There they are, just same as before, FFD8, FFE1, 0982. So now, let's look at the next four bytes following those six in the file, and we'll get them in a sub- BLOB all by themselves so that we can then convert th em to a character string using BLOB to Char. When I run this, I get the character string " Exif." You may have noticed in the slide showing the binary file contents that that little string was up there near the top, and it's part of the Exif file specification and identifies it as such. Let's go back to the slides. Those functions are powerful. They let us do what we need to do to manipulate and read data from a BLOB, but they're a little cumbersome to use. Let's write our own utilities to make them a bit more manageable. I'm going to start with a function that I've named Read Value and it takes a BLOB and then some offset within that BLOB, and then the numeric type I want to read and the size of that type. It's going to read one value out of the BLOB. I passed my BLOB and offset and size into BLOB Peek, get back a sub- BLOB of just those bytes, and then call Blob To Matrix passing in the type. I use the same size, so the size of the BLOB and the size of an element are the same. I should get back a matrix of one value and I pass in "big" because I'm just always going to use big-endian format. But I don't want to return a matrix. I want to return that one value, so I pull it out of the matrix and return that. This is called like so. I call Read Value, I pass in b. I read one unsigned int starting at offset zero and it's two bytes long and I get back that value FFD8. There's a problem with this code though, and that's in this parameter b. B is that BLOB that's 3.4 meg in size. The problem is that JSL, when passing a BLOB to a function, always passes it by value, meaning it makes a copy of it. For every single number I want to pull out of my BLOB using this function, it will make a copy of that 3.4 meg just to pull out two bytes, or four bytes or whatever, and then throw it away when the function returns. That's inefficient, wasteful, and probably really slow, so we don't want to do that. How can we get around that? Well, instead of passing it as a parameter, let's put it in a global. We'll make a global that we load with our BLOB, and then we can call the function and it'll just refer to the global instead of a parameter. In fact, we can make a bunch of globals. We can record the length of the BLOB to offset to where we are currently processing data in the BLOB, and maybe even for the endianness. The problem with globals, though, is that they are in the global symbol table, and they might interfere with other code that you have. In fact, we'd like to write our importing code as something that can be used by other clients, and those clients might have their own variables by these names, or they might be using other code libraries that would interfere. How do we get around that? Well, I've done it by creating a namespace. I call my namespace " EXIF Parser" Now, instead of globals, I put them all as variables inside my namespace, and now they're N amespace globals with that prefix. Before I call my function, I need to initialize them. I'll load the "Beach.jpeg" file into the EXIF P arser BLOB, I'll record its link. I'll start off my offset at the very beginning at zero, and I'll set the endianness to "big," and then I can change my function, simplifying it a bit like this. Now, I've actually put my function in the same namespace Read Value as part of the EXIF Parser namespace. Now, all I need to do is pass a type and a size. BLOB Peek now uses the global BLOB that's stored in the namespace, the global current offset that we're reading from, and BLOB the matrix even uses the endianness, so we can parameterize that, we can change it. Once I've retrieved the result I want, I'll increment the offset by the number of bytes we just process, and then I'll return it as before. Let's see what that looks like in action. We'll look at Demo 2 and here's my namespace. Here are my globals in that namespace. Here's the Read Value function, just like we saw. I've got some more functions. I've put an EXIF Parser Here's Read Short, which just cause Read Value, but it always passes unsigned integer of two bytes. Similarly, Read Long reads and unsigned int of four bytes. I've also got Read Ascii which you pass it a size in bytes and it makes a sub- BLOB of that many bytes at the current offset from the global BLOB, and then cause BLOB to charge to convert it into a string. It's using the "us-ascii" char set, because that's the charset that the Exif specification says all of its character data uses. Then just like with Read Value, it increments to offset past the bytes that were already processed and returns the string. Let's submit all of this code so that those things are all defined and then we can try to use them. First, we'll initialize our EXIF P arser globals, and then I'll read the first three shorts from the file just like we did before. But now, I'm going to call Read Short. We're starting it off at zero, so I'm going to call it three times in this loop. It will read in each successive short, advancing offset as it goes, and then print them out just like before. There they are. Now, our offset is sitting at offset six just past the last thing it read. I can call Read Ascii for four bytes and I get back that same string. Okay, so let's go back to slides. Now, we have some tools we can use to start building our EXIF Parser. We need to dig into the specifications to see what does the EXIF data format and this JPEG file look like. Well, at the top level, it looks something like this. It starts with two bytes, which are what's called the start of image marker. We've already seen those two bytes. They're the value FFD 8. If your file doesn't start with that, it's not a JPEG. Then there's a series of blocks of data, and each block starts with two bytes, which is a marker, then two bytes, which is a size, and then some data which is however many bytes the size of there were. Now the size also includes itself, so really the data is size minus two. You can see there, are a bunch of different block types defined, but some of them are optional, some of them can be repeated. The ones that we care about are APP1 and APP2. That's where the EXIF data will be. Then there's a bunch of others that we don't care about. Eventually, we see one called Start of Scan or SOS. When we hit that, we know that the next part of the file will be the actual image data, which is the pixels. When we hit that, we can stop. Then after the image data is end of image. We need an algorithm to read this data. Here's what we'll use. First, we'll read the first two bytes, see if it matches the start of image marker, and then we know we have a JPEG. Then we'll have a while loop, where within the while loop each time through, we'll process a single block. To do that, we will save the current offset position, read the two bytes for the next marker, and if that marker is SOS, we can break out of the loop. Next, we read the two bytes for the size. Now we have all the information we need to process the block. Whatever that entails, we'll do it. Then we can skip past the data in case processing the block didn't change our offset at all, but we'll explicitly move our offset to whatever it was at the beginning of the loop, plus the two bytes for the marker, plus the value of the block size. When we get out of the loop, we either ran out of data in the block to process or we hit that SOS marker, so we're done. Let's see what that looks like. Demo 3 has this code. You see, we have our namespace, and all this is the same as before. I'm going to run it just to make sure everything is defined. Now, we're adding a new function, which I've also put in the namespace, and it's called EXIF Parser:Get EXIF Data Raw, and I'm passing in the path to the file, that JPEG file that we want to process. Now, I've defined an associative array here that maps those magic marker codes to their abbreviations so that we can print them out on the log. I load up my Exif Parser globals like before, only now I'm passing in the file path that I was given, and then I start interpreting what the data is. First, I look at the very first short and make sure it's at the start of image marker. If not, I just return because it's not a jpeg file. I'm going to write to the log that I saw it at offset zero. Then here's my while loop to walk through those blocks. At the top of the loop, I'm going to reset my endianness to "big," because some of the blocks, when we process them, will have their own endianness and change it to little. We want to know that the endianness is big at the top of the loop because the block structure always uses big-endian data. Then I'm going to save whatever the current offset is, and then I'm going to read the next marker. It's a short, and I'd look to see if it's equal to SOS, which is that magic number. If it is, I can break out of the loop after logging that I saw it. Next, I'll read the two bytes for the block size, and then I will process the block. Now, in this example, my processing consists of writing a message to the log, so I'll do that. Then I'm ready to skip past the block. I do that by changing my offset to be whatever it was at the beginning, plus the two bytes for the marker, plus the block size. When I break out of the loop, I reset my global s and I'm done. Let's define that function by submitting this. I'll run script, and then I can call it passing in "Beach.jpg," let's see what we get. It printed out to the log. It offset zero, there's start of image, then it offset two, there's APP1, and it has this size 2,466 bytes. Then we get APP2, which has about 30K of data. That's most of it right there. Then we have a bunch of blocks that we don't really care about, but eventually, we see SOS so we break out of the loop. That's all working well. Let's go back to slides. Now, I'm going to skip ahead in processing some of this file format just for the sake of time. But if you download the paper that's associated with this talk, the full code is there and much more detail. I highly recommend that you do that, but I'm going to give you a flavor of it here. What we do next is we process each of those blocks that we have read, and some of them we can ignore. We want to filter out the blocks that do not contain excess data. Then the ones that do, we need to do its own parsing. What we've discovered when we look into the Exif specification is that these blocks contain their own set of blocks of data called Image F ile Directories or IFDs. Then those contain individual metadata information with tags saying what the data is and then what format it's in, and then the data itself. We want to collect all of those things together into these lists. There'll be lists of lists of lists, a somewhat complicated data structure. But the list data structure is very generic and JMP and can hold all kinds of data, so that's what we want to use. It'll have those metadata items tagged with these numeric values that we call the raw tags, but we want to replace those with actual human readable labels that identify what they are. Let's look at this and JMP, and I'll look at Demo 4 . Now, at this point, I have taken all my code in the "EXIF Parser" namespace and put it into its own file. Now Demo 4 is a client of my code, which is an "EXIF Parser.jsl." I can just include that. Now, the function that gets the data, I can call it passing in the name of the jpeg file. Now I've extended this function in here to actually process those blocks, and break them down, filter out the ones that are not EXIF all the things we just talked about, and give us back that data structure. Let's run this and see what we get. Well, we get a lot of numbers, some strings. You see, this is in lists within lists here, and there's an outermost list, and then it contains different items which are lists. Each of those have these pairs of values. There's a number which is the raw tag, and then the value. This one's a string. This one is also a string. This one's a number. This one's a matrix. The data can be different types, but the tags are all numbers. What we want to do next is convert those numbers into human readable items. In fact, this whole list, we want to convert it into an associative array that indexes the data by keys. The first thing I'm going to do is define a mapping from these numbers to the human readable keys we want to use. That's in this long associative array right here. Now, I'm actually using the Hex values for the keys, because that's how they're specified in the Exif specification, and it makes it easier to follow along when you're looking at the spec. There's a bunch of those, and I'm going to start at the bottom and work my way up. Down here at this function, Label Raw Data. I passed in this whole data structure that we got back from parsing the BLOB. Here's the definition of Label Raw Data. I'm going to return a list as my result, so I'm going to walk through the list as my input and use it to build up the result list. I use this For Each construct, which is a new modern JMP function that walks through a list, and for each element of the list, it pulls that out into this variable raw exif, which I'll pass to this function. Then I want to append it to my result. I do that using this Insert Into line. I'm inserting it to the end of result, and I have to use Eval List to overcome something that JMP is doing to be helpful with list creation. Again, there's more detail about this stuff in the paper. It's worth downloading and checking into. But for here, we're just going to look at the call to Label EXIF, and that's right up here. Label EXIF is going to do a similar thing where it's going to walk through each of these tag value pairs. Instead of returning a list, it's going to return an associative array. Here, we are initializing results to be an associate array. That's what this token means. It's an empty associative array, and then we'll return it at the end. We'll also use for each to walk through the list, and we know that each raw item is going to be a list of two elements. We get the first element, which is the raw tag, and then the second element is the data. We simply build up our associative array by adding an item keyed by the raw tag with the value of the data. Pretty simple, except we don't want the key to be the raw tag. We want to transform it using our lookup table. That's what Get Tag does, and that's defined next up here. It simply takes the tag id we pass in, converts it to hexadecimal. This will give us back a four-character hexadecimal string. We need the right most four characters from that, and then we look it up in our ifd tag array up here. I'm going to submit all this code to define it, and then we'll have it call Label Raw Data. Here's the result. You can see that it's similar to before, except now, instead of the topmost level list, it doesn't have another list, it has an associative array. This top most thing is a list of associated arrays. This first one we can see that the raw key whatever the number was, got converted to DateT ime and there's its value and so on. But we've noticed the second one looks like it has tags that didn't get converted. Why is that? Well, it's because this key ExifIFD has as its data. Actually, another IDF, yes, this is a recursive data structure that's defined in terms of itself. If we want to label the things inside here, we have to change our code to label recursively, and we'll get to that in a minute. But before I leave this, I want to show that I'm going to actually combine these two steps into a single function that I call Get EXIF data, where I first get the raw Exif data out of the BLOB, and then I label it, and then I return the result of that. Let's define and run that, and it should be exactly the same as what we just saw. Sure enough, it is. I'm going to close this and go back to slides. Yeah, skipping ahead again, as I mentioned, we have to do our labeling recursively. As I mentioned, some entries in our metadata or IFDs have as their data another IFD. That means we have to call our labeling routine recursively. The way that I do it, is to use this JSL built- in Recurse, which calls the current to function, and you can pass in separate parameters for the recursive call. There's more details on that in the paper, which I'm sure you've already downloaded at this point. Now, the one thing to be aware of is that some of these embedded IFDs, most of them use the same look-up table that we already defined, but some of them have their own lookup table. We have to make sure we're passing the correct look-up table with its own definition of tags to our recursive call as we're going through the different levels of recursion. Then once we have a fully labeled data structure return, we can extract pieces from it to get the things we're interested in. We can run that over a whole series of images and collect all that data into a data table or some other format. Let's look at that in JMP. We'll pull up Demo 5, which here I've rolled all of that recursive labeling into my Get EXIF Data function. I'm going to include my namespace code and then run that function. Now we're getting back our fully labeled data structure. You can see that now this EXIF ID has labels in it. There is this big block of data of numeric stuff, and we look and we see that that's in this thing called Maker Note. Maker Note is a special extension to the Exif specification that allows the maker of a particular device, in this case , Apple, the maker of an iPhone 12, to embed their own proprietary data. In some cases, a camera manufacturer might reveal what they've embedded there. In other cases, people have sort of guessed at it and come up with their best guess. That's the case with Apple. There's some things that are known and other things that aren't known. You'd see a lot of this is just untagged. But some things we can see, acceleration vector, and runtime, and whatnot in there. Anyway, we're going to ignore that for the most part and look at what this thing contains. I can see that it's an associative array in that first element, and that's where most of the things I want to deal with is I'm going to pull that out into its own variable right here. It has 14 elements, so we can see what those are, what their keys are like so. There are those keys. If we want to pull something out like "Model," I can do that simply by subscripting into exif1 "Model" and I can see it's iPhone 12 Pro. I can do the same thing to get the date time, and there it is. But this is in the date time format that the Exif specification defines. That's not a format that JMP recognizes, but I can use JMP's in format function to convert it into something using this format pattern option, which is a modern JSL thing that lets us specify the pattern of the date time data, and JMP will convert it to a numeric date time, which it recognizes as such, and formats it for the log. That worked. Now, I'm also interested in the GPS coordinates, and that's in this GPS IFD part of the EXIF. It is one of those that is itself an IFD. Let's access it, and then we can see what it contains. It has information about the altitude, differential, image direction, latitude, longitude, and speed. What we care about is the latitude and longitude, which is these four things. There's latitude, there's longitude, and then they have these associated Ref values. We need all four of those to compute the coordinates. Let's start with the latitude. We'll pull that out into its own variable, and we see it's a list of three elements. If we look at what those three elements are, we see that there are three vectors with two numbers in each one. The Exif specification refers to these as rationals, and it uses them a lot. But what we want is actual numbers instead of these rationals, numerator ,denominator. We can convert them using this Transform Each function, which loops across this list and processes each element after putting it into a local variable r. We want to process that by dividing the denominator into the numerator, and then transform each builds a new list of those results and puts it in our variable, that being, like you see there. Now these three numbers are the degrees, minutes, and seconds, but we want to combine them all into a single value for JMP to use. We have to add them together, scaling each component appropriately. Do that there and we get the value. Now, if that's in the Northern hemisphere, we're fine. But if it's in the Southern hemisphere, it needs to be negative. If we look at that "GPSL atitudeR ef," it's either N or S, which tells us if it's S, we need to negate it. We can do the same thing for longitude. Its LongitudeR ef will be either E or W. E is positive east values and W is negative west values. Here we see we had to negate the longitude because it was in the West. If we dump those two numbers out, we can use JMP's built- in formats for latitude and longitude, and we can verify that they match what the Finder can pull out of the file as those coordinates. We can see that it's North and West, so it's in North America. Now I'm going to skip to Demo 6 , where we put all this together. I'm going to use that code to pull information out of a whole bunch of photos. In this folder, I have 16 of them, and they're photos that I've taken at previous JMP Discovery Summits in Europe on years back when we used to do them in person. It'll put all the info it finds into a data table. I'm going to run this and here's our data table, and we can see that I've captured the names of the image files and the timestamps. You can see it goes from 2016 to 2018, and the Lat and L ongs are North and East, so that's Europe. I can even see the progression of various iPhone models I had across those years and how their lenses improved over time. I've set this up so that I can select a row and click this Get Info table script, and it opens a window for me that shows me the photograph and the metadata for it that I've captured. I even have a button here showing Google Maps, so I can click that, and up pops a Google map of that location. It's right there with the red marker. I can see if I z oom in that this is the Hilton Amsterdam. That's where we had the conference in 2016. That all seems to work well, I n this case, I'm going to add the photos themselves to my data table as an expression column. That's what this Add Photos table script does. For larger collections, I would not want to do that because it's actually making a copy of those photos into the data table. But for 16, it's fine. For thousands or even hundreds, you'd probably not want to do that. But it's also set my new column to be a label column. Now I can run this Explorer script which opens a graph builder of those latitudes and longitudes. I can see some points here. Here's Amsterdam. There's a photo we just saw and here's some more. That's definitely Amsterdam. These are Brussels. Yes, that's Brussels. Over here, we have Frankfurt. That year, we got to go to the cool supercar Museum. That was pretty neat. Over here we have Prague. I'm going to use the magnifier to zoom in on Prague a couple of times. A t this point our detailed Earth background is not really helping us much so I'm going to switch to street map service. We can see, yes, that's definitely Prague, and here is where we rode the historic street cars up to Prague Castle Here is some JMP attendees crossing the bridge to Prague Castle. I can see John fall there in the distance. Over here, we have a very nice reception that we held in the municipal hall. Here's me checking the time on my Apple watch against the orlloi to make sure that it's right. That all seems to be doing what we want. In conclusion, I want to touch on the things that we learned. We learned about the JSL object BLOB, which is a good tool to have on your tool belt for manipulating arbitrary binary data. We use that to build up a little application for importing files. Along the way, we learned some things about namespaces, and JSL recursion, specialized list handling, and some modern JSL things like Transform Each, For Each. Ag ain those things are covered in much more detail in the paper. But most importantly, I think is that we saw a case study of how to take a difficult problem like a complex file format and break it into smaller subtasks that we could conquer. That's a skill that we all have and need to make use of in our professional work. But I think it's many times helpful to observe someone else doing it and pick up tips and tools and techniques that we can then use in our own work. Now I want to turn it over to you to take these tools and use them to import and write the code to create your own binary files and solve your own problems. Thank you very much.
In industrial development, especially if experiments are conducted on production scale, the number of DOE runs required to cover the actual problem is always too high, either in terms of costs or time. At the 2021 Discovery Summit, the introduction of SVEM (self-validated ensemble modeling) caught my attention, due to its power to build accurate and complex models with a limited number of experiments. Especially in the area of costly experiments, SVEM opens a way to fit complex models in DOEs with considerably reduced amount of runs, compared to classical designs.   Three case studies are presented. The first two case studies deal with designs conducted on a production scale, first on a five-factor, 13-run design and then on a four-factor, 10-run design, each with nonlinear problems. Another use case shows a Bayesian I-optimal 15-factor, 23-run design for a nonlinear problem. Especially within the first use case, the excellent predictive accuracy of the models obtained by SVEM led to the discovery of faulty measurement equipment, as measurement results started to deviate from the predicted results. I'm convinced that SVEM has the potential to effectively change the way DOE will be applied in product development.     Hello, and welcome, everybody, to my today's presentation. Thank you for joining. I will give you a short talk about the concept of SVEM, or self- validated ensemble modeling, and how this concept can be a path towards DOEs with complex models and very few runs. I hope I can show you that on some use cases, that this concept is actually the solution towards these DOEs with complex models and very few runs. But before I start going into detail of my presentation, I want to give you a short presentation of the company I'm working for. It's Lohmann. Lohmann is the producer of adhesive tapes, so mainly pressure- sensitive adhesive tapes, but also structural adhesive tapes. The Lohmann Tape Group made a turnover in 2020 of about €670 million. It consists of two parts. It's the Lohmann T ape Group, that's who I'm working for, and the joint venture, Lohmann & Rauscher . The Tape Group had a turnover in 2020 of about €300 billion, and last year, we celebrated our 170th birthday. Basically, the Tape Group is divided in two major parts. That's the technical products, where I'm working for. We offer mainly double- sided adhesive tapes, pressure-sensitive adhesive tapes, and structural adhesive tapes. We are solely B2 B company. And the other part is our hygiene brand. Ninety percent of our products are customized products. We are active worldwide. We have 24 sites around the globe and about 1,800 employees. Who are we? We are the bonding engineers, and as I mentioned before, 90 percent of our products are customized, so our main goal is to offer our customers the best possible solution, or to offer him a customized solution very specific for his problem or to solve his problem. In order to do that, we have a quite large toolbox in order to do that, so that consists of a... We are able to do our own base polymers for pressure- sensitive adhesive tapes. We can do our own formulation to formulate either pressure- sensitive or structural adhesive. We can do our own coating that needs to form the final adhesive tape. But we can also do all the lamination, die cutting, processing, doing spools, rolls, so offering him the adhesive solution in the form he actually needs or he actually wants, in order to solve or to best satisfy his needs. But we can also do all the testing and also help the customer in integrating our adhesive solution into his process. Having this quite large toolbox or these many tools in order to fulfill the customer's needs also comes together with a lot of degrees of freedom in order to do that. That means degrees of freedom or this huge amount of degrees of freedom, the large value change which we have and dealing with that complexity. One of the solution to deal with this complexity for us is to use DOE, or design of experiments, for us in the development department mainly for the polymerization, formulation, or coating. That brings me back to the topic of my talk. Since we use our DOE to tackle this problem or to tackle this l arge complexity, and being able to use all the tools in the best possible and most efficient way to fulfill our customer's demand, we use DOE, but let's say we want to have that as efficient as possible to tackle complex problems with the lowest amount of experimental effort. There, I think, where SVEM as a concept comes in. As you all know, during the process development or product development, you run over the same stages. We start in the lab, doing our experiment on a laboratory scale and switch to the pilot scale, then final scale of step when going to the production in order to produce the final product. Doing this way, let's say, the effort per experiment, might be time or cost, dramatically increases. In order to minimize these costs, we use DOE in order to minimize the experimental effort. But also, if you go the steps from the lab over the pilot to the production, the higher the effort per experiment is, the more critical are the number of experiments. Situations where the number of experiments is critical might be, let's say for us, but also let's apply for other industries, is if you have to do experiments on a pilot or a production scale. Or even if you do experiments on a laboratory scale, if you have, for example, long- running experiments over the year, let's say the analysis of an experiment takes very long, that might be a situation where the number of experiments is very critical. But also in combination with that, if you have complex models or complex problems which you have to address, if you, for example, need a full RSM model or if you want to do an optimization, you will run into this problem where you always might have a large amount of experiments in order to model this complex problem. The best situation would be that you can generate DOEs, which allows you to trace a very complex problem or to apply complex models, but at the same time, keep the number of experiments as low as possible. Just to give you an example in our case, the production of adhesive tapes. If I do experiments on a laboratory scale, I need less than a kilogram of adhesive per experiment, and I definitely don't need more than one or two square meters of adhesive tape in order to do all the full analysis which I want. If I then move to the pilot scale, depending on the experiment we want to do, you might need between 10 or 50 kilogram of adhesive, and your cost maybe 25- 100 square meters of adhesive tape per experiment. I f you go even further into the production scale, you might need even more than 100 kilogram per experiment of a dhesive and you r cost maybe 1,000 square meter of adhesive tape per experiment. But at the same time, I still only need about one square meter or two square meter to do the full analysis of my experiment. If you are in an unfortunate situation, I have to waste 99.9 percent of my product is basically waste. That's a lot of material used for one experiment, and it also comes along with quite an enormous amount of cost per experiment. Just to give you an illustration, that's a picture of our current pilot line. T hat's the size of a door. Even for a pilot line, it's quite large, so you can imagine the amount of material you need is quite large. And that's numbers per experiment. But even on a laboratory scale, you might run into the situations where the number of runs is critical. It's either if you have complex models or large amount of factors, and I've shown you that in the last use case. Because it's a chemical reaction where we want to vary the ingredients, but also the process parameters if you have long- running experiments where experiments run for more than one day, or the analysis of the experiment takes very long, and at the same time, you have very limited project time or project budget. In all these situations, it is very desirable to minimize the experimental effort as much as possible, so to decrease the number of runs as much as possible. In 2021, last year's Discovery Summit, I came across two presentations. One by Ramsey, Gottwalt, and Lemkus, and the other one by Kay and Hersh, talking about the concept of SVEMs or self- validated ensemble modeling, and for me that raised the question of, "Can that concept be actually the solution for DOEs with few experiments, but at the same time, keep the complexity of your model for the problem which you have?" With that said, I want to switch over now to my use cases in order to hopefully show you that this concept is actually or might be actually a solution to that problem. I want to switch over to JMP. The first example I want to show you today is where we had to do a design of experiments on an actual production scale, on production equipment. The second one is done on a pilot plant, and the third one was done on the lab. I have some information for you where the first two examples were done or the design was created without knowing the concept of SVEM, but the analysis was done knowing then the concept of SVEM. The third example wasn't actually done after I knew about the concept of SVEM, and then I applied or I designed the DOE specifically knowing about the analysis concept of SVEM. So I'll go to the first example. In the first example, we want to do a product development by means of a process optimization. We could only do that process optimization on the actual production equipment, so we had to do all the experiments on a production scale. If you imagine though, you always have to find production slots, and you have to always compete with the actual production orders, so your normal daily operation of business. So the number of runs which you can do on actual production equipment is always very limited and it's always very costly, and you always have to fit in the normal production. What you want to do at that point, for the first example, we had five continuous process factors, we know we expected a nonlinear behavior, so we knew we had to use the full quadratic models and the expected interactions, and we also know that we had a maximum of 50 run due to limited production time, time, and money. What we came up? Again, as I've told before, that was done before I knew about the concept of SVEM. When we created the design, we put all the linear and quadratic effects that we need necessary in the custom design platform and we set the interactions to optional, and that's the 15 runs which we ended up. And then we started with the experiments. Unfortunately, only 12 runs could be accomplished because we had limited time and capacity, and we were running out of time in order to do the last three remaining runs. After 12 runs, we basically had to stop and then do the analysis, but fortunately for us, we could do the remaining three experiments after the analysis was done. That brought us into the position where these extra three experiments which were conducted after the analysis was an actually test set where we could check the validity of our model, which we created using the concept of SVEM. The actual data analysis, it's shown in these two presentations from last year's Discovery Summit, and I put in the code so you can find it very easily on the community page. The data was analyzed using SVEM, so I did 100 iterations. I used a full quadratic model, used all the interactions possible with these continuous process factors, and I used the LASSO as the estimation method. I just put in for the script how it's done this analysis using SVEM, but also, I would want to refer to these two presentations where it's all written down or explained in great detail. Coming to the results, I want to show you the profiler, and what you can actually see that we actually need. Second order terms, so we have a quadratic dependency. But we also had interactions. So everything what we expected was basically present. But let's go ahead and check for the validity. You might say, "Okay, you just over fitted the model. You just used too much model terms. It's too complex. That's why you have second order terms. That's why you have interactions." Well, let's look at the predicted versus actual plot. For that example, we're in the good position where we had actually a test set, so these three remaining experiment which the model has never seen, and the 12 , let's say, that's the red dots. A s you can see here though, if you watch it, the predicted versus actual plot, that's not too bad. T he prediction of these three remaining ones is pretty good. If I check and see if, for example, another response, the prediction is very, very accurate. A s I mentioned before, these runs, the model has never seen. For us, that's what's quite amazing. To have only 12 runs may have crippled this sort of experiment because three of the runs were missing, and still having this very good predictive capability of that method SVEM. So in general, prediction was very good. We could predict the remaining three runs very accurately except one response. Everything fitted except one response, it didn't fit at all. We thought, "That can't be. We can't just predict 10 of the 11 responses almost perfectly, and the 11th one doesn't fit." And we thought, "Okay, something has to be wrong. Our model is good because it fitted 10 times, so it has to fit also the 11th time." And we thought, "Okay, something with the measurement has to be wrong, because the experiment can't be, because the other responses, they fit. Our prediction was very good." We said to the expert, "S omething in the measurement had to be wrong." So our expert digged a little bit deeper into the measurement equipment, and they actually found deviations in the measurement. T he measurements on that last response was a little bit different than the first 12 runs . A fter the deviations were corrected, the prediction was again very good. That means that this analysis or the prediction was so good that we even found deviations in the measurements. From that first example, I would say SVEM works perfectly, and gives you very great insight in a very good and very accurate model with a very limited amount of experimental effort. But you could say , "Okay, you could have also used a different screening design and you would have, for example, 11 runs, and you would have got exactly the same result of that." Might be true. But then I refer to the third example. Second example is very similar. We wanted to do DOE on a pilot plant, and we had four continuous factors. Two of those factors required the production o f specially produced ingredients because those two, they represented a process variation of our window to operate for our current process. We also expected a non-linear behavior, and we were told not to exceed 10 runs. But it was strongly recommended to us that fewer runs would be more than welcome because of capacity issues. So we end up with a hybrid production/ laboratory approach, where we ended up with only having six runs, or we only needed six runs on the pilot scale or the production scale, that they boosted to 10 runs in the lab. That's basically the design which we ended up. That's the way I created the design. I set the linear and some interaction terms to "necessary" in that custom design platform, and the remaining interaction and quadratic term, we set them to "optional." Again, that was done, the creation of the design, before I knew about the concept of SVEM, but in essence was then done using SVEM. For us, in that particular case, the goal was to... Let's say factor 1 and 2 represent the process variability. What we wanted to achieve is we wanted to minimize a certain response, in that case response number 7, to minimize its variability without changing factor 1 and 2, because changing the process variability which we [inaudible 00:17:56] have, so it's very difficult. But we had two additional factors that were param eters from the coating equipment which we could change very easily. The goal was to minimize response number 7 but keep all the rest basically in the window which we wanted. Again, analysis was done pretty much the same way like before. We used SVEM like we presented in the last year's presentations. We did 100 iterations. We again use the full quadratic models with all the interactions, and we used LASSO as the estimation method. Just showing you that on the profile again, we had second order terms, we had interactions as you can see here, for example. They were all present, all what we expected to be there like second quadratic dependencies and interactions actually were there. For our goal was to minimize this response and without changing these factor 1 and 2, because that's the process windows which we're currently operating in. During the process range, if you look at that response here, it changes quite a bit. Doing all the optimization and having this model fully operational, we found out if you just change factor, for example, number 3, variability basically vanishes completely without changing the rest of the factor. So we find with only six runs on a production scale, optimal settings [inaudible 00:19:36] to enhance or optimize a process quite considerably. You might again say, "Okay, four factors. You could only again may have used a definite screen design." Or let's say, "Ten runs is not too few runs in order to do that." But that brings me to my last and final example. We were doing a design of experiments on the laboratory scale, and we wanted to optimize or understand in more detail a polymerization process. We wanted to do all factors and variations within one single design experiment. We ended up with having 15 factors, eight ingredients, and seven process parameters. We didn't have a mixed design. We used these eight ingredients, fiddled around a little bit so we didn't end up with a mixed design. But as the experiment itself was very time- consuming, the number of experiments should be low, ideally below 30. Because we had limited project times, and 30 is something which we might be able to afford. But at that moment, I already knew about the concept of SVEM so I did the design of the experiment specifically knowing about or applying the concept of SVEM. So what I did, I chose the Bayesian I-optimal design. We knew there will be non- linear behavior, so all the second order terms and interactions were in there. What we ended up was a design that you can see here, having only 23 runs, and having 15 factors of ingredients plus process parameters, basically a full RSM model and only have like 23 runs. For me, it was pretty amazing to be able to do that. Let's see the result if it actually did work, because we weren't so sure that it actually worked. But we were convinced that it will be known at the time that it actually does work. Analysis was, again, done the same way. The only difference, I won't show you a profiler for that example because having 15 factors, quite complex models, and having to implement it in JSL makes it very, very slow. But I was told that it's going to be implemented in JMP 17, which then will be much faster so I'm really hoping to get JMP 17 as fast as possible. What we found out is that non- linearity was present. We had again interactions, and we had second order terms which were significant. So again, we needed the full model. The experimental verification is currently ongoing. First results look very promising. I haven't put them in. Just in contrast to a classic approach, we would have required much more experiments than just 23. And again, to show you that we had very good prediction capability, to just plot some predicted versus actual. That's just for one response, it's very good. Just to give you another one, it's not as good as the first one, but still, for us, it's good enough. So the predictions here are very good. For the first experiment, for the verification runs, it looks very promising. Just to show you some image of the profiler. It's unfortunate non- interactive. Second order terms were present, interactions were present, so all that what we expected was basically in the model. With that, I'm at the end of the use cases. I want to switch back. Just to give you short summary. Kay and Hersh, they put in their last year's D iscovery Summit the title "Re-Thinking the Design and Analysis of Experiments" and put in a question mark. From my point of view, you don't need the question mark. It's like that. I'm convinced that SVEM will change the way DOE is used, especially if you go for either very complex designs, or large number of factors, or if you have to go through very costly experiments. At least, that's my opinion. It opens a new way towards DOEs with minimalistic experimental runs or effort, and at the same time, you can gather the maximum amount of information and insight out of that without having to sacrifice maybe number of runs or number of factors or the complexity of your models. I'm convinced SVEM will change the way at least I use DOE, and for me, it was pretty amazing to see that in action. With that, thank you very much for watching, and I hope I could show you that this concept is worth trying. Thank you very much.
Monday, September 12, 2022
For many applications, JMP is like a blank slate, waiting for you to decide what to do. However, sometimes it’s better to start with more than a blank slate. Sometimes you want a guided experience. For some time now, JMP has had guided experiences and wizards for such tasks as opening spreadsheets, importing data and accessing databases. But it’s time to take it to the next level for more general applications:   Action Recording, introduced in JMP 16, has been a great enabler to bridge the two worlds of the interactive and the automated. It allows you do the work interactively and then use the recording to make a script for the same work the next time. JMP 17 takes that to the next level, wrapping the recorded actions into a workflow environment where you can edit, customize, debug, generalize and save without ever needing to see the underlying code. Julian Parris demonstrates the new Workflow Builder in JMP 17. For engineers new to designing and analyzing experiments, JMP can now guide you through all the steps of the experiment. Joseph Morgan explains how Easy DOE makes this process easier than ever before by providing a guided workflow in a wizard-like unified environment that helps make all the best decisions for you, sparing you from the burden of having to figure it all out. There’s also a middle path for applications such as medical review of clinical data. The standardization of the data model, with CDISC, enables a semiautomated path to the analysis. JMP Clinical begins a new era with a much faster, more flexible, more full-featured environment for analyzing clinical data. Once you have accessed and refined your data, and produced a compelling analysis, what is the natural next step? Sharing your findings with others in your organization. JMP Live makes this natural next step easy and powerful. You can share an interactive JMP analysis with your colleagues, even if they don't use JMP themselves. In this way, you can use JMP Live to help drive decisions in your organization. In a JMP Live space, collaboration permissions can easily be turned on to allow you and your trusted colleagues to see, download, or even edit and replace each other's data and reports. These flexible collaboration spaces can increase the speed and the ease of learning in your organization. Whether your work can be planned ahead or is a wild path of exploration and serendipity, JMP has the modes to make your work speedy, efficient and productive, while keeping you in an environment of discovery.    
Monday, September 12, 2022
According to Wikipedia, an unconference is a participant-driven meeting designed to encourage attendee involvement in topic selection and knowledge sharing. Join us for these lightly structured discussions to share your ideas and learn more about JSL scripting.   We had some great discussions and much needed time to reconnect and network with peers and colleagues from around the world.  We mentioned many resources during the sessions for learning JSL, parsing report outputs and more.    Below are the many of the content items that were mentioned or utilized during the sessions. Like minded folks similar to yourselves have a question, know the answer, or just want to hang out and learn follow the discussions.   Others want to be part of a group where they can teach and share their knowledge and resources.  For you we recommend the JMP Scripting Users Group.  The rich JMP Scripting Language (JSL) lets you work interactively and save results for reuse. It even allows you to develop new functionality to solve problems that core JMP does not address. We´d like to help you to take full advantage of JMP and its automation capabilities. To do so we´ve created the JMP Scripters Club. A community of JMP users who are eager to learn and leverage the JMP Scripting Language and to share their knowledge with fellow JMP users.   Do you enjoy cooking? Or are you like me where you love me appetizer? Maybe your are a fan of the full course meal or just want a sweet snack? Then hop on over to the JSL Cookbook  where you'll find tasty and delightful recipes using the extraordinary ingredients available from the JMP Scripting Language.    @Wendy_Murphrey and @_jr showed us how to FLASH ( our reports)  and then parse them with XPATH queries. We learned that JMP is using version 1 of XPATH.  We saw on the fly how to implement an query to dynamically pull estimates from the report output.   Namespaces oh my!  @drewfoglia shared options for launching JMP platforms with pre-populated selections. He also mentioned namespaces and I refer the reader to the Namespaces online documentation  for more information as it relates to the topic.  Drew also shared his paperObject-Oriented JSL – Techniques for Writing Maintainable/Extendable JSL Code and another great resource was mentioned during the session Essential Scripting for Efficiency and Reproducibility: Do Less to Do More (2019-US-TUT-290).   Flows—sharing shortcuts and developing repeatable analyses are available in the upcoming JMP 17 release. We heard from @Mandy_Chambers about workflow builder.  I bet you like me are really excited to get our hands dirty with this great new capability.   Looking to learn JSL? Head on over to the Learn section of the community where you can find free on demand courses, discount codes and more.  JMP Education shared some great news about our first free on demand course Introduction to the JSL Scripting Language, available immediately.  We heard about certification exams and how to receive a special discount code that enables you to receive 55% off the certification exam.   Looking for syntactic sugar or have literal value dilemmas check out the informative information provided by @Joseph_morgan2 He shared a brief refresher from his JSL Tutorial that he delivered as part of Discovery Summit Europe in Prague.  He also reminded us that the JMPer Cable has excellent articles available for consumption in short snippets, for example,  Expression Handling Functions: Part I - Unraveling the Expr(), NameExpr(), Eval(), ... Conundrum.  Joseph also shared a link to Using JSL to Develop Efficient, Robust Applications.   If you were lucky enough to attend the in-person or virtual event please leave us a comment below and let me know what you enjoyed most.
We lose approximately 920,000 shelter animals to euthanasia requests every year. Instead, these animals could’ve made 920,000 families happier. We would like to explore the current data from Austin’s animal center to understand what conditions lead to a euthanasia request and if measures can be adopted to prevent them. The data is sourced from Austin’s Open Data Portal and consists of two tables - intakes and outcomes, dating from Oct 1, 2013, to the present. Intakes represent the status of animals as they arrive at the animal center while outcomes represent the status of animals as they leave. Each animal in is identified by a unique Animal ID. Each table consists of 136K data points and 12 features. We first explore the distribution of data by various categories such as breed, gender, age, and intake condition. Finally, classification models like logistic regression and random forest classifiers are used to make predictions on whether an animal will be euthanized. Understanding the key factors like their intake condition, sub-type of euthanasia, breed, and age could unveil crucial insights into understanding the causes for these animals to be put down and consequently advise on where to target funding for research and facilities.     Hi,  my  name  is  Shalika Siddique My  name  is  Anand Manivannan and  we're  both  students  from  Oklahoma S tate  University and  we  currently  pursuing  a  business   analytics  and  data  science  degree. Today  we  are  presenting  a  boaster  where  we  explore  euthanasia in animal  shelters, and  we  hope  to  understand  why  cats  and  dogs  are  being  put  down. Every  year  we  lose  about  920,000  animals  annually. Using   JMP Pro,   we  would  like  to  identify  the  key  factors, that  lead  to  euthanization  of  cats  and  dogs. Once  we  identify  these  key factors, funds  can  be  channeled  to  relevant  sectors to  prevent  euthanization   of  animals  that  could  have  been  saved. In  addition  to  this, we  aim  to  make  predictions  to  identify which  animals  are  most  likely  to  be  euthanized. A  little  information   about  our  data  set  here, we  source  the  data from  Austin's  data  portal, and  the  animal  shelter  that  we  use  for  analysis  is  located in  Austin,  Texas. Overall,   we  had  about  130,000  records. After  cleaning  and  filtering, we  focused  on  about  67,000  records that  were  specific  to  cats  and  dogs. Prior  to  our  analysis we  explode  the  data  set and  attempted  to  derive  insights. We  use  JMPs,   graph  builder  to  create  visualizations such  as  bar  graphs. From  the  67,000  records, there  were  about  3,171  animals which  were  euthanized. Which  is  about  4.7 %  animals  of  the  shelter. In  comparison  to  animals surrendered  to  the  shelter  by  the  owner, stray  animals   were  most  prone  to  euthanasia. When  we  compare   the  age  of  the  animals, we  notice  that  kittens  under  the  age  of  15 months, contribute  to  25 %  of  euthanasia, while  pups  contributed  to  13 %  of  euthanization This  bar  graph  here, is  an  example  of  one  of  the  visualizations created  using  JMPs  builder. The  lavender  bar   represents  cats, while  the purple bar  represents  dogs. We  can  see  that  intact  males  followed  by  intact  females are  more  prone  to  euthanization, in compared  to neutered  animals. Next A nand  will  go  over  in  detail  over  the  modeling. Thank  you Shalika. Yes,  I'd  like  to  talk  a  bit  more  about  our  approach  towards modeling  using  JMP. Before  we  could  start  modeling, we  performed  a  few  data  preprocessing  steps  to  prepare  our  data. We  did  things  like   standardizing  units for  certain  variables, such  as  age,  which was  in  weeks,  months  and  years. We  wanted... We  converted   that  to  just  months. We  bend  on  the  age variable  so  we  could  convert  it  into a  categorical  variable. It  looked  in  the... It  looked  like  age ranges  like  10- 15  and  15- 25. We  grouped   rare breeds  and  colors  to  reduce the  number  of  categories. Additionally,   we  also  filtered  just  cats  and  dogs from  all  the  other  animals that  went  through  the  shelter. During  a  modeling  phase   we  noticed  something  very  peculiar. We  noticed  that  class  imbalance  in  our  target  variable, which  talked  about whether  an  animal  was  adopted, and  whether   an  animal  was  euthanized. About  64,000  records, out  of  67,000  records  were  adopted animals,  and  only  3,000  animals  were  euthanized  animal. Since  a  model   was  to  focus  on  predicting  euthanasia, we  had  to  resolve  this  issue, and  hence  we  used  JMPs  Bootstrap  model and   Boosted Forest  to  resolve  this  issue. It  used  the  concepts  of  bagging  and  boosting  to  do  this. Since  bagging   and  boosting  models  don't  really give  a  lot  of  room  for  interpretation in  terms  of  what  the  variables  do, we  used  logistic  regression to  interpret  these  variables  as  well. After  modeling, we  tuned  up  parameters  to  get  the  best  results. We  chose  a  few  certain  metrics, to  choose  the  best  model  based   on  its  performance  on  validation  data. We  used   a  70:30 %  validation  split, and  prior  to  modeling, we  also  tested  the  assumptions for  logistic  regression. Or  over  to  the  top   right,  you  can  see  that  we  tested for  multicollinearity  and  independence  among  variables using  JMPs  contingency  analysis, which  spread  out  a  muse  plot, and  gave  us  a  P  and  correlation  value, that  basically  told  us  which  variable  was  correlated  with  each  other. Now  I'd  like  to  dig  a  bit  deep  into  each  model and  how  we  selected  our  models. Over  to  the  top  left, you  would  see   that  we  chose  metrics  like  specificity, this  classification  area  under  the  Cove  and  R-S quare to  choose which  model  performed  the  best. These  metrics   were  chosen  for  a  particular  reason that  aligned  with  our  goal. Our goal  was  to  predict which  animals  would  want  to  be  euthanized. The  cost  of  our  model, incorrectly  predicting  a  euthanized  animal, as  a  non- euthanized   animal,  would  mean  that  animal would  probably die  and  not  be  saved. Hence  we  wanted   to  focus  on  increasing  the  accuracy of  euthanized  animals and  reducing  the  misclassifications. Hence  these particular  metrics, were chosen First we  ran   the  nominal  logistic  regression  model, which  you  can  see  over  to  the  bottom  left. The  Log  worth  immediately  gave  us  which  variables were  the  most  important  in  predicting  euthanasia. Turns  out  it  was   sex  of  the  animal  intake  condition, intake  type   and  outcome age . A  lot  of  these  are  not  surprising, and  it  matched  with  what  research  shows. The  whole  model  turned  out  to  be  significant  as  well, the  P- value   less  than  0.001 Following  that,   we  ran  the  Bootstrap F orest  model, which  was  tuned  to  have a  hundred  trees  and  feature  selection criteria  value  of  three  Bootstrap. We used receiving  operating characteristic  or  the  AUC  curve, to  determine   which  classification  threshold gave  us  the  best   classification  results. We  ended  up  using  0.1  or  10 %   as  our  classification  threshold. Over  to  the  right, you  would  see  that  we  ran  the   Boosted Forest  model, with  parameters  of 87  layers  and  a  learning  rate  of  0.179 Over  at  the  bottom, we  used  the  decision  matrix   for  all  three  models  to  calculate the  specificity of  each  particular  model. Which you  give  us   how  accurately  the  euthanized  animals were  being  predicted. We  also  use   misclassification  rate  and  R-S quare from  the  overall   statistic  tab  of  JMP. In  every  metric, we  found  that  our  Bootstrap  model outperformed  the  other  models, and  hence  we  chose  that  as  the  winning  model  to  make predictions  on  euthanasia. Next,   I  would  like  to  go  over, some   important  results  that  logistic  regression  gave  us. With  regards  to  sex, we  found   that  intact  cats  and  intact  dogs, were  way  more  likely  to  be  euthanized  than neuter spayed  animals. With  regards  to  breed, we  found  that  mixed  cat  breeds, and   Pit Bull  dog  mixed  breeds,   were  more  likely  to  be  euthanized than  all  other  breeds. With  regards  to age  we  found  that  cats  that  are  4.5 to 6  years,  are  more  likely   to  be  euthanized than  younger  cats. Dogs  under  1.2  years   are  the  least  likely  to be euthanized. This was  widely  surprising  because  it's  contradictory to  what  we  found  during  a  data  exploration phase. Similarly,   with  regard  to  intake  type, we  found  that   older  surrendered  animals, are  twice  more  likely  to  be euthanized  than  stray  animals. This  is  completely,  again,  contradictory  to  what  we  found in  the  data  exploration  phase. That  goes  to  show  that  what  the  power  of  statistical analysis  and  unbeating  true  facts. Next,  I  will  be  handing  it  off  to  Shalika  again to  go  what  recommendations  we can  make  to  these  animal  shelters. Thank  you Anand. Based  on  our  analysis, we  have  a  few  recommendations that  animal  shelters   could  use  to  lower  euthanizations. We  believe   that  animals  taken  into  the  shelter should  be  neutered  or  spayed This  is  in  accordance  with  medical  research, which  proves  that  intact animals  are  more  prone  to  diseases. Animal  shelters  could  also  use  our  Bootstrap  Forest  model to  prioritize which  animals  needs  to  be  saved, in  case  a  difficult  decision  needs  to  be  made. In  support  of  that, here  are  some  recommendations   from  Austin's  animal  shelter. This  particular  shelter  would  need  to  prioritize  cats  over  dogs as  they are  more  prone  to   euthanizations. With  regards  to  age, cats  aged  between  4.5-6  years, and  dogs  over  1.2  years  would  require  more  attention. Owner  surrendered  dogs   need  to  be  prioritized  over  stray  animals. Finally,   when  it  comes  to  breeds, Pit Bull  mix  dog  breeds   and  mixed  cat  breeds, are  more  prone  to euthanization and  would  likely  require  more  attention. That  brings  us  to  the  end  of  our  presentation. We  hope  that   animal  shelters  could  use  this  analysis, to  reduce  the  need for  an  animal  to  be  euthanized. Thank you.
The crime rate in the United States is increasing every year, and this is something that needs to be addressed. A key objective of this paper is to identify factors that statistically impact the crime rate in each state and leverage that information in order to reduce crime rates. For our analysis, we will make use of datasets: U.S. Census Data and Uniform Crime Reporting Program Data collected by the Federal Bureau of Investigation in 2014. We were able to get crime statistics for all 50 US states along with a detailed breakdown of crimes and input variables such as income and literacy to test their impact. Additionally, we intend to identify correlations between the types of crimes so that we can understand the core issues and identify crimes that may influence others. As a result, the allocation of resources and optimization of results for crime reduction would be improved. The findings of this paper will help us to understand the various socio-economic and locational factors that influence crime and, possibly, break certain stereotypes. This could be amazing for government bodies in constructing rules to combat crime in their areas. We were able to find that weapon and drug-based crimes had a high correlation with the other crimes. After testing the various factors in determining the crime rate in any given state, the top 3 were Weapons owned, Literacy Rate, and the percentage of people who follow a religion.      Good  afternoon. Today  we're  going  to  be  talking  about the  understanding of  crime  rate  and  crime  prediction. Before  we  go  into  the  data, let  me  introduce  a  team. We  have  Karanveer, our data  scientist   and  data  modeling  expert, and  myself Grant Lackey  as  a data  researcher  and data  visualization  specialist. So  before  we  get  into  the  data, let's  do  an  overview of  the  entire  presentation. We're  going  to  begin  with  the  background which  will  be  the  initial  data  sets and  why  we  chose  our  data. The  data  overview, which  again  is  going  into  the  reason why  we  chose  our  data  and  what we're  going  to  be  trying  to  answer. The  business  problems, which  is  the  problems  that we  had  with  our  data. Why  we're  trying  to  answer  certain questions  and  the  overall idea  of  the  entire  project. Next  is  our  methods  and  plans, which  is  our  procedure  of  answering our  business  problem and  then  our  results, which  are  the  results  of our  methods'  plans. Our  applications, which  are  real- life  implications from  our  results,  and   post-analysis, which  is  what  we  could  include or  add  on  to  our  results. What  we  could  add  on to  improve  upon  this  and  years  to  come. Beginning  with  background, why  should  we  care  about  crime  rate? Well  crime  is  just  important  to  everyone, and it's  everywhere  in  the  United  States, and  so  what  is  crime  rate and  how  can  we  define  it? How  we  define  crime  rate  is  our  initial criminal  activity, divided  by  the  population  density per  county  or  per  state. We're  mainly  going  to  be looking  at  per  state. So  how  are  we  going  to  identify  factors which  can  reduce  crime  rates throughout  our  entire  project. Here  we're  going  to  be  speaking about  how  certain  crime is  going  to  be  more  influential in  certain  states  than  others, and  do  certain  crimes  influence other  crimes? So  for  example, if  there  was  a  murder  crime, would  guns  or  theft  be  more influential  in  that  murder  crime  or  would other  crimes  be  influential  in  that? So  looking  at  our  data  overview. We  started  off  with  our  initial  data  set which  is  our  crime  statistics  data, and  we  added  other  data  variables  later on  throughout  this  initial  data  set. Beginning  with  our  initial  data  set, we  started  with  2014  data, and this  initial  data  set  was  given  to  us  from Federal  Bureau  of investigation:  the  FBI. We  looked  at  42  criminal  activities which  are  wide  range from  murder,  theft  to  drug  possession, drug  activities. We  looked  at  about  3,200  counties, and  within  all  these  counties or  within  these  states would  be  all  those  counties. We  had  to  look  at  48  states. We  had  to  exclude  Florida  and  Illinois, because  Florida  and  Illinois  did  not provide  data  to  the  FBI for  the  criminal  activities. If  you  look  at  future  2018  data or  past  2012  data,  it's  the  same  issue. They  just  don't  provide data  to  the  FBI  it  seems . With  all  of  this  for  2014  data, there's  180,000  data  points talking  about  the  FIPS  codes. This  is  how  we  identified  certain criminal  activity  in  certain  counties. For  example,  we  have  our  state  codes, which  would  be 01  for  Alabama. These  states  are  represented alphabetically. Alabama  would  be  the  first  one, and  then  each  county  within  that  state would  have  numbers  to  them. For  example,  Baldwin  would  be  003. If  you  looked  at  Baldwin,  Alabama, it'd  be  01003,  so  on  and  so  forth for  every  county  detailed  in  the  state. Looking  at  our  extra  variables, we  looked  at  the census  data. Census  data  is  always  great  for  checking out  the  age,  population, income  per  county  or  per  state, and  we  had  to  look  at  other  data  sets  like gender,  immigration,  religion,  marriage, unemployment  and  literacy  rates. These  other  data  sets  looked  more or  so  at  the  statewide  rates, and  this  isn't  really  related to  criminal  activity, but  we  wanted  to  involve  it within  our  initial  data  set  to  see  if there's  any  correlation  with  them. Going  into  our  business  problem, we  want  to  answer  what  states in  the  United  States  specifically have  the  highest  and  lowest  crime rates  and  why  is  that  so? To  answer  our  business  problem, we  have  to  answer  these  business questions  going  into  that. How  can  we  identify  variables that  influence  crime? Which  are  the  most  important  factors? Are  there  crimes  that influence  other  crimes? I'm  going  to  hand  it  off  to  Karanveer to  talk  about  plans and  methods. Thank  you,  Grant. Our  approach  to  solve  this business  problem, was  to  come  up with  a  regression  model. We  have  used  JMP  to  make  it. First,  as  Grant  mentioned, we  have  connected  the  various  databases, that  is  the  crime  data  set, along  with  those  extra  variables such  as  religion,  income,  etc. We have  made  sure  whether the  data  looks  clean. A fter  that,  we  have  run our  regression  model, which  is  able  to  predict the  crime  rate  for  us. With  this,  we  are  able  to  know the  various  variables and  their  importance  in determining  this  crime  rate, and  we  are  able  to  list them  by  their  importance. A t  the  end  we'll  also  be  showing you  visualizations  based  on  it. As  Grant  mentioned, we  had  42  criminal  activity  variables. Some  of  these  variables  were  very  small, such  as  drug  possession, drug  consumption,  drug  sales. In  that  case,  we  have  simply  grouped them  to  make  sure  that we  can  come  on  a  conclusion  on  that since  the  data  was  otherwise too  small  for  the  subgroups. We'll  be  looking them  state wise as  we  didn't  have  the  extra variables  on  a  county  basis. But  I  feel  that  this  is  great for  starting  this  project. Our  target  variable would  be  the  crime  rate. We  have  defined  the  crime  rate as  the  number  of  arrest in  that  certain  population. Now,  coming  down  to  the  variables that  we  are  using. Most  of  these  variables  have been  normalized  and  we  have  used a  percentage  for  them, such  as  immigration  for  gender. We  will  be  using  two  types  that  will  be a  male  and  a  female, and  then  religion,  unemployment, marriage,  literacy. Most  of  these  are  normalized  so  that we  don't  have  an  analysis which  could  be  misleading. Coming  down  to  the  final  equation of  our  regression  model. This  is  the  equation  of  a  model. We  have  rounded  off  the  samples, and  as  we  can  see  there  are  a  lot  of  variables that  have  a  positive  influence, as  in,  that  they  increase  the  crime  rate, and  there  are  certain  variables which  have  a  negative  sign  with  them. They  basically  decrease  the  crime  rate. Using  this  we  can  see  how  we  can  define a  crime  rate  in  any  state  or  county. Coming  down  to  the  results. The  finding  number  one. We  really  wanted  to  see  which  states have  the  highest  crime  rate. These  are  the  following  five  states. Tennessee,  Wyoming,  Mississippi, Wisconsin,  New  Mexico. Then  we  have  the  following  states with  the  lowest  crime  rate, that  are  New  York,  Alabama,  Vermont, Massachusetts  and  Michigan. Here is  a  following  visualization explaining  how  the  crime  rate varies  across  United  States. As  we  see,  there  is  no  certain pattern  and  it's  all  over  the  place. Finding  number  two. Using  JMP  and  doing  a   log [inaudible 00:07:57] on  the  variables,  we  could  basically  see which  variables  have  more  importance. The  number  one  was  weapon  owned, followed  by  literacy  rate, then  religion  percentage,  immigration, population  density, and  the  unemployment  rate. I  think  this  is  a  great  finding, while  any  government  body or  any  organization  wants to  allocate  resources whenever  they  are  trying  to  reduce the  crime  rate  or  trying  to  analyze  it. The  finding  number  three  is something  really  interesting. Our  goal  was  to  see  whether there  are  certain  crimes which  could  help  us  solve not  just  that  crime, but  maybe  other  crimes  as  well. Which  these  crimes  are trying  to  influence. Drug  and  weapon  was  one  of  them. We  could  see  drug  and  weapons have  a  very  high  correlation with  say,  theft,  robbery,  murder. Using  a  chi- square  test, we  saw  that  the  correlation  is  very  high. So  in  case  any  organization would  want  to  focus  on  and  start  with, I  think  drug  and  weapon is  a  great  category  where they  can  focus  at  for  reducing  crime  rate in  any  state  or  county. This  is  the  following  map  showing the  religion  rate,  weapons owned, and  literacy  rate, and  the  variation  across  United  States. If  we  put  it  with  the  crime  rate, we  can  see  a  certain  pattern which  is  actually  explained by  our  regression  model. Now  coming  down  to  the  implications, how  we  can  use  our  analysis to  a  real- world  solution. Like  the  data  set  we  have  used, and  we  have  connected  to  variables, we  would  definitely  want  to  work  with governments,  towns  and  communities because  crime  is  a  universal  problem and  this  is  something everybody  wants  to  reduce. The  restore  allocation  can  be  done according  to  this, and  further,  this  would result  in  a  decrease  in  crime  rate and  increase in  happiness in  the  community. Post- analysis. There  are  a  lot  of  things  that  we  would want  to  include  in  our  project, and  this  is  a  great  future  scope  as  well. First  thing,  we  would  want  to  include more  variables such  as  weather,  ethnicity and  the  list  goes  on. We  could  definitely  even  listen to  the  government  bodies  and  take  inputs for  these  variables  from  them. County  detailed  or  at  least  city  detail. I  feel  it's  great  to  start  with state- wise  data, but  we  would  definitely  want  to  focus on  a  more  detailed  level  of  analysis, so  that  we  can  use  these conclusions  to  the  real  world more  clearly,  more  precisely and  we  would  have  a  better  impact  as  well. The  data  time  frame. Right  now  we  have  used the  data  from  the  year  2014. I  feel  this  is  an  eight  year  old  data  set. We  would  definitely  want  to  use a  more  latest  data  set, and  something  that  is spanning  over  a  couple  of  years, so  that  it  gives  us  clarity. Since  COVID  has  impacted  us in  a  lot  of  ways, and  it  has  changed  how  basically lives  are  working  around  us, and  so  has  crime  rate  and  the  way crime  happened  has  been  changed. We  would  definitely  want  to focus  post  COVID  , and  over  the  last  two- three  years, for  a  post- analysis. That's  all  and  thank  you.
Heart disease and strokes are two major diseases that have been around for years without a cure. Heart disease is the leading cause of death in the United States, resulting in one death every 36 seconds. Of these deaths, one in six people die due to a stroke, which is also the leading cause for long-term disabilities.   For our research project, we explore whether these two major diseases have common factors that can predict each other. First, we built a logistic regression model for each disease. Next, we made a new variable, which returns 1 if the person has both diseases and 0 if not. Finally, we did a final analysis to see which variables in these two models can predict both diseases in one equation. From our research, we identified that the variables general health, diabetes and health coverage are the most useful in determining whether or not a person will suffer from heart disease or a stroke in their lifetime.     Hi,  my  name  is  Brittany  Burlison, and  my  copresenter  is... I'm  Kailey  Wilson. We  are  both  second  year  master's students  at  Oklahoma  State  University, getting  a  master's  in  business analytics  and  data  science. Today,  we  are  going  to  present our  research  in  what  is  most  important in  determining  heart  disease  and  stroke. We  will  be  going  over our  research  overview, the  methods  that  we've  used in  our  data  overview,  our  data  analysis, and  results  and  implications, and  what  we've  done  in  JMP. Heart  disease  and  strokes   are  two  major  diseases that  have  been  around  for  years, and  there's  still  no  cure  for  them. Heart  disease  is  a  leading cause  of  death  in  the  United  States. A  person  dies  every  30 seconds from  heart  disease. Of  these  deaths, one  in  six  die  due  to  a  stroke, and  strokes  are  the  leading cause  for  long-term  disabilities. For  our  research,  we  are  looking  to  see if these  two  major  diseases have  any  common  factors  that  will  be  able to  predict  each  other. We  are  interested  in  seeing  what  factors are  most  important  in  determining whether  a  person  will  suffer  from  stroke or  heart  disease  in  their  lifetime. We  are  wanting  to  take  variables that  correlate  to  the  Social  Determinants of  Health  to  see  what  variables  play a  bigger  role  in  determining these  major  health  issues. For  our  data, we  will  be  using  for  analysis  is  the  data from  the  Behavioral  Risk  Factor Surveillance  System,  in  short, BRFSS, from  the  CDC. This  is  a  phone  survey  that  collects  data from  citizens  regarding a  plethora  of  information. We  will  be  using  data  from  2016  to  2020. This  contains  over  500  fields and  over  2  million  observations. Some  of  the  fields  contain  information about  households, current  health  conditions,  behaviors  and  demographics. Additionally,  some  States  have  the  option to  be  more  specific  health  questions, and  those  are  considered  too. We  will  be  looking  at  the  variables that  people  are  asked. For  the  methods  and  plans that  we  are  going  to  use. Our  data  site  contains over  500  variables,  as  we  mentioned, so  we  have  narrowed  that  list down  to  11  that  we  have  deemed the  most  important  in  determining heart  disease  or  stroke. We  have  referenced the  social  determinants  of  health to  help  us  make  this  decision on  which  variables  we  should  keep. And we  have  determined  a  few that  we'll  go  over  in the  next  slide. So  we're  using  JMP,  specifically, the  fit  model  resource  in JMP and  graph  builder. The  factors  that  we  are  considering  is a  person's  sex,  their  age,  and  their  race. So  for  our  variable  selection, we  have  determined  that  income,  housing, education,  mental  health, health  coverage, overall  general  health,  smoking  status, diabetes  state,  divorce,  and  medical  costs were  the  most  important variables  to  look  at. We  will  be  using  stroke  and  heart disease  as our  response  variables. We  will  look  at  these  variables by  gender  using  the   sex  variable. Then  we  will  concatenate all  five  years  of  our  data in JMP, run  a  fit  model  test  to  determine which  preselected  variables are  the  most  important in  determining  heart  disease  and  stroke. Kaylee  will  go  over  our  data analysis  and  what  we  have  found. Thank  you,  Britney. The  first response  variable that  we  looked  at  is  heart  disease. When  sex  is  1, that  means  it's  a  male. So  as  we  can  see  in  our  output, that  the  most  important  variables, based  on  their  log worth, were  general  health,  diabetes, and  if  they  were  a  smoker. Even  though  the  RS quare  is  pretty low,  which  means  that  only  8 %  of  the  data is  explained  by  these  variables, since  the  p- value  is  very  small that  means  that  the  variables  that  we have  selected  are  very  significant. Same  we  can  see  over  when  it's  a  female. Similarly,  the  most  important  variables are  general  health,  diabetes, and  if  they  smoke  or  not. We  can  come  to  the  similar  conclusion that  the  RS quare  is  very  low, which  makes  sense since  there  are  500  variables. But  the  variables  that  were  selected are  still  very  significant. Next,  we  wanted  to  look  at   what  heart  disease  looked  like based  on  general  health. So  general  health  was  a  variable that  was  split  into  nine  buckets. One  being  excellent  health and  nine  being  very  poor. So  we  can  see  that. When  heart  disease  is  one, that  means  that  they  had  heart  disease, and  when  it's  two,  that  means they  did  not  have  heart  disease. As  we  can  see, when  general  health  is  two  or  three, which  means  very  good general  health  or  good  general  health, those  two  had  the  highest number  of  heart  disease. Next,  we  wanted  to  look  at  stroke. For  stroke,  for  a  female, the  most  important  variables, out  of  the  variables  we  selected were  diabetes,  general  health, and  then  education. Similarly,  we  have  a  very  low  RS quare, but  our  significance or  our  p- value  is  very  small, which  means  that  all  of  these variables  are  still  very  significant. Similarly,  for  males, the  most  important  variables are  diabetes,  general  health. T hen  the  RS quare  for  this  one is  the  smallest  RS quare  we  have  seen, but   we  still  have  a  p- value  of  less  than... A  very  small  p- value,  which  means it  is  still  very  significant. S imilarly,  as  we  did  for  heart  disease, we  built  a  graph  too based  on  general  health, to  look  at  where  stroke  fell in  the  general  health  response. And  the  general  health  it  falls into is,  again,  two  and  three, which  means  people  with  very  good  health, or  good  health, are  most  likely  to  have  stroke. Then  we  went  and  we  created our  own  variable for  when  someone  had a  heart  disease and  stroke, they  would  return  a  value of one, and  when  they  didn't have  it, it  would  be  zero. So  here  we  can  see  for  heart disease  and  stroke,   the  most  important  variables are  general  health,  diabetes, income,  if  they  smoke,  and  education. This  RS quare  is  our  highest  RS quare, which  is  really  good. This  means  that  most  of  the  data is  represented  in  this and  our  p- value  is  still  very  small, which  means  that  all of  these  are  significant. Then  again,  we  made  a  graph  to  see where  the  general  health  it  fell. We  can  see  that  for  when  someone  has heart  disease  and  stroke, it  falls  in  three, which  is  good  general  health. Our  conclusions  is, we  found  that  the  most  important  variables determining  whether  or  not a  person  will  have  heart  disease is  general health,  diabetes,  smoking, and  if  their  parents  are  divorced, and  that  was  for  the  males. Then  for  females,  it's  general  health, diabetes,  smoking  and  income. Then  I'm  looking  at  stroke. For  a  female,  it's  diabetes, general  health  and  education. In males,  it  is  diabetes, general  health  and  health  insurance. Then  for  both  of  them  combined, the  most  important  ones are  general  health,  diabetes, income,  if  they  smoke, and  their  education. So  drawing  to  a  close, our  overall  implications. We  would  say, to  help  prevent  heart  disease, people  should  improve  their  overall general  health,  monitor  their  diabetes, decrease  their  nicotine use,  etc. Then  to  help  prevent  stroke, people  should  improve their  general  health, monitor  their  diabetes  as  well, and  think  about improving  their  health  care  plan. Then  overall,   people  should  just  focus  on their  general  health  to  prevent  heart disease  and  stroke  and  any  other  diseases. We  believe  that  doctors and  healthcare  providers, if  they  take  this into  consideration, these  are  super  important  factors in  determining  whether  a  person will suffer  from  heart  disease  or  stroke in  their  lifetime, and  they  will  be  able  to  provide better  health  care  options to  their  patients. Additionally,   we  feel  that  if  the  general  public take  these  factors  into  consideration, it  can  help  reduce  the  risk  of  stroke or heart  disease  overall in  the  general  public. We  thank  you  for  listening to  our  research, and  if  you  have  any  questions, please  let  us  know. Thank  you.
With the rise in internet usage during the COVID-19 pandemic, it is no surprise that there was also increased popularity of online chess. In this study, we have investigated and analyzed low to moderately rated online chess players and the games they participated in. We utilized data sets from Chess.com in which we were provided with data concerned with individual players, clubs, tournaments, teams, countries, daily puzzles, streamers, and leaderboards. We utilized JMP and Python to complete our analysis. We took a random sample of low to moderately rated players from October 4, 2020 to March 4, 2021. We noted the portable game notation and specific moves completed by each player. When beginner-level chess players utilize certain moves constantly, they are more likely to see consistent wins, therefore increasing their status from beginner to moderate player. When looking at these moves on an individual basis it is unclear the impact on success, however when move combinations were further examined, the prediction of success was much more accurate. The results of our analysis allowed us to identify a series of moves most moderately rated players employ leading up to a game-losing move. With the COVID-19 pandemic occurring during the data collection, the data may be skewed. External environmental factors such as the pandemic may lead to inaccurate results and findings. This research and analysis aims to help chess trainers and coaches in better formulating strategies and training exercises to help beginner to moderately rated players improve their skills.   Introduction: Chess is one of the oldest and most widespread sports across the world. With the introduction of new technology and increasing internet accessibility, people have been given the opportunity to play chess in virtually any area of the world. As popularity and access to chess continue to increase, it is important for players to understand the best way to improve their game skills. In this study, we will be investigating chess players with a low or moderate rating. Games participated in by these players will be looked at in depth to allow for us to better understand the blunders and mistakes that determine the results of games, and in turn, change player ranking. To first grasp the reach of our study, we set out to understand external factors that have affected the population of the chess community, such as advancing technology and the COVID-19 pandemic. The research provided will help players and readers to reach a  better understanding of openings and tactics that are most beneficial to low and moderately ranked players when navigating online chess. In this case, low rated players will be defined as players rated between 800 to 1000, while moderately rated players will be defined by a range of rating between 1000 to 1300. Our research will be specified to six countries: Canada, Australia, United Kingdom, United States, India, and Bangladesh.  The overall purpose of this study is to pinpoint consistent blunders and mistake patterns in moderately ranked players and utilize them to devise strategies that will increase competition and wins. This study will point to a direction in which an optimal winning strategy can be determined, and ultimately help online chess players change their player ranking.    Data Overview: The data collected to complete this research was provided by Chess.com, one of the top online chess communities that offers players online chess games for free. We accessed the website’s API database, where we gathered data revolving around individual players such as their profile, titled players, stats, and online gamer status. In addition, we were also provided with access to specific games including current daily chess, concise-to-move daily chess, available archives, monthly archives, and multi-game PNG download. In order to complete a more accurate and in depth analysis, we also downloaded and utilized specific country data including the country profile, list of players in each country, and a list of clubs within the country.    Access to data: https://www.chess.com/news/view/published-data-api / https://lichess.org/api   Mined data Name of the Variable Description Username Username of both players Elo ELO rating of both players Result Result of the match ECO Code Unique code indicating the opening employed in the game PGN (Portable game notation) The entire series of moves in the game in a text format   Generated Data   Name of the Variable Description Blunder PGN PGN of moves leading up to a blunder Mistake PGN PGN of moves leading up to a mistake Method: The method in approaching this data first began with the cleaning and mining of the accessed data. The data was mostly clean when it was received, however, there were minor edits and changes that needed to be made in order to continue forward in the analysis process. After the data was cleaned, reviewed, and processed, we took a random sample of low to moderately rated players in the United States, United Kingdom, Canada, Australia, India, and Bangladesh from October 4th, 2020 to March 4th, 2021.. For each randomly selected player, we investigated five rapid games that the player participated in. As each rapid game was assessed,   After our initial assessment and investigation, we utilized Python to form code that would allow for us to merge, join, and compare the datasets compiled for each country and its selected players. A random sample of 1000 games were selected from the pool of  users in our target rating range of 1000-1400. These games were then analyzed using Stockfish at a depth of 20. Using stockfish evaluation of the position at each move we come up with a score indicating which player has a better position quantitatively. The unit used in such a score is called Centipawns. A score of  +100 Centipawn signifies an advantage of 1 pawn of the white player over the black player. After each move a new score was calculated along with the change in score from the previous move. We define two classes of moves, a blunder and a mistake. A blunder means the move made by the player has cost them a 500 centipawn disadvantage while a mistake has a threshold of 300 centipawns. Blunders would create a worse position for the blundering player, leading to higher losing chances for the player. By identifying blunders and mistakes we generate variable Blunder_pgn, which would be a PGN string with the series of move leading upto the blunder   Results: Using Blunder PGN and Mistake PGN we were able to identify a series of moves most moderately rated players employ leading upto a game losing move. We identified 3 Blunder and 4 Mistake PGN’s which players struggle with the most among all combinations at our target rating level.     Pic 1: Scandinavian defense and its success rate       Pic 2: Blackmar Gambit and its success rate       Pic 3: Center Game and its success rate   Mistake prone openings       Implications: Black players should refrain from Blackmar Gambit and scandinavian defense. White players generally have an advantage but tend to struggle with the center game openings. While there are different openings with different problems the general trend of weak opening principles in blundering players is observed specifically: Pawn sacrifices without compensation Queen safety Development of pieces   Conclusion: Moderate rated players play the most accurate when they employ standard openings such as London system and the Giuoco Piano Game, hence should be trained on these fundamentals first before moving onto complicated openings   References:   https://www.chess.com/analysis https://python-chess.readthedocs.io/en/latest/pgn.html https://stockfishchess.org/     All  right,   good  afternoon, and  today  I'm  going  to  be  talking about  the  predictive  analysis of  online  chess  outcomes  and  success. My  name  is  Allison  Clift and  I  had  the  opportunity to  work  on  this  project with  another  another  student in  my  business  analytics  program,   Calbe  Abbas  Agaria, however,  he  is  not  with  us  here  today. To  begin,  we  analyzed  low   and  moderately- rated  online  chess  players. Since  the   COVID-19  pandemic, there  was  an  increase  in  Internet  usage as  well  as  with   the  advancement  of  technology, people  have  switched  over  to  playing  online  chess as  it  is  more  readily  available  to  users. We  wanted  to  look  at  the  effectiveness of  different  game  strategies, specific  moves,  and  individual  techniques, and  their  impact  on  potential  wins or  potential  losses  in  the  game  of  chess. Player  data  was  pulled  from  chess.com, which  is  where  we  were  able  to  view profile  of  the  player, titled  players,  their  statistics, and  the  online  gamer  status. We  utilized  JMP  and  Python to  be  able  to  complete  the  study. We  noted  the   Portable Game Notation,   also  known  as  the  PGN. This  was  used  to  determine the  openings,  blunders,  and  mistakes that  were  occurring during  the  competition. We  learnt  that  looking  at  individual  moves on  their  own  was  not  as  predictive as  looking  at  move combinations  as  a  whole. It  was  found  that  the  prediction of  chess  was  much  more  accurate when  we  looked at  different  move  combinations. We  were  able  to  identify  moves from  moderately- rated  players to  employ  leading   up  to  game- losing  moves  such  as  blunders or  different  opening  moves that  led  to  more  success. The  analysis  aims  to  help  chest  trainers and  coaches  in  finding  weak  points and  beginner  to  moderately- rated  players to  help  them  increase  their  player  rating. They  will  also  be  able to  formulate  better  strategies and  training  exercises  to  help these  players  improve  their  skills. Like  I  said, the  increasing  popularity  of  virtual  chess really  encouraged  us to  complete  this  study. We  wanted  to  investigate  and  understand the  differing  game  strategies employed  by  beginner and  moderately- rated  players. We  wanted  to  determine the  optimal  winning  strategy for  these  players  to  help  them increase  their  rating on  the  online  platform. We  wanted  to  learn how  to  help  these  players be  able  to  determine  a  specific strategy  to  utilize  moving  forward. To  begin  with  our  methods, we  started  by  sampling  the  data we  received  from  chess.com. After  cleaning  and  mining  the  data, we  were  able  to  collect a  random  sample  of  players from  the  United  States, the  United  Kingdom,  Canada, Australia,  India,  and  Bangladesh. Looking  through  our  own  research, we  found  that  this  is  where  chess was  most  popular  in  the  past  few  years. So we  really  wanted  to  look at  that  data  in  specific. Specifically,  we  looked  at  the  data from  October  4th,  2020  to  March  4th,  2021. We  did  this  in  order to  avoid  potential  implications from  looking  at  data  that  occurred during  the   COVID-19  pandemic when  internet  usage was  at  its  highest. We  also  were  able  to  do some  feature  generation. We  generated  two  features which  allowed  for  the  users to  determine  move  combinations that  led  up  to  blunders  or  mistakes. Here,  we  created the   Blunder PGN  and  the   Mistake PGN. The   Blunder PGN  was  just the  record  of  moves  that  were  made by  a  player leading  up  to  a  blunder  and  chess. The   Mistake PGN  was  just a  collection  of  moves  that  a  player  made leading  up  to  a  mistake. This  is  what  allowed  us to  complete  our  analysis. Next,  we  utilized a  Python  code  to  merge,  join, and  compare all  of  the  data  that  we  collected. This  data  was  compiled of  five  games  per  player from  about  a  1,000  to  1,400- player  rating. We  selected  1000  games randomly  from  this  selection  of  data. While  we  were  looking at  this  data,  we  wanted  to  do... We  measured  it   and  using  a  stockfish  depth  of  20. To  describe these  measures  a  little  bit  more, it  was  measured in  what  we  call  a   centipawn  in  chess. A  plus  100   centipawn signifies  that  there  is  an  advantage of  one  pawn  of  the white  player  over  the  black  player. During  a  blunder, this  means  that  a  move  made by  one  player  has  cost  them a  negative  500  centipawn  disadvantage. A  mistake  is  equivalent to  a  negative  300  centipawn  disadvantage. A  blunder  is  normally   what  occurs  in  a  game  losing  mistake. Down  to  the  bottom  you  can  see some  analysis  that  we  conducted  via  JMP. In  this  graph  right  here, it  is  the  top  ten  most  used  openings  in  blunders. As  you  can  see,  the  number  one used  opening  that  leads  to  blunders is  the   Queen's Pawn  Opening  London  system. Secondly,  we  look  at  the  Scandinavian Defence  that  is  oftenly  used and  this  can  be  led  to  blunders  as  well. I  will  mention  these again  later  in  the  results and  the  conclusions of  our  presentation. At  the  bottom  you  can  just  see  two  graphs. These  graphs  just  show  the  number  of  wins that  are  occurring  per  level  of  player. We  can  look at  the  lowest- rated  players up  to  the  highest- rated  players. These  show  just  the  average number  of  losses  in  comparison. Over  to  the  right you  can  see  the  blunder  flag which  this  is  just  the  white player  versus  the  black  player. At  the  bottom  is  the  list of  frequencies  that  occur during  these  moves that  are  made  to  the  left. For  example,  you  can  see when  we  look  at  the  London  System  Opening, it  is  about  half  and  half for  white  players and  black  players in  the  wins  and  loss  ratio. However,  when  we  look  at  the  Scandinavian Defence,  we  can  see  that  the  white  players often  make  blunders  more  often compared  to  the  black  players. When  we  look  at  our  results  using the   Blunder PGN  and  the   Mistake PGN features  that  we  developed, we  were  able  to  identify  a  series  of  moves that  most  moderately  players  employ leading  up  to  a  losing  move. We  identified  three  blunders and  four  Mistake PGNs, which  players  struggle with  the  most  among  all  combinations. For  one,  black  players  should  refrain from  the   Blackmar Gambit and  the   Scandinavian Defence. The   Blackmar Gambit  only  results in  about  29.3%  of  wins for  black  chess  players. Secondly,  the   Scandinavian Defence only  equivalents  in  about  27.7% for  players that  are  using  the  black pawn . White pawn  players  generally have  an  advantage  here. They  do  struggle with  center  openings  though. When  we  look  at  what  moves  and  openings the  w hite pawn  players  utilize when  they  move  strictly forward  in  the  center, they  tend  to  lose  games  more  often. Lastly,  weak  openings and  blundering  players. There  were  a  few  openings that  we  were  able  to  identify that  consistently led  to  blunders  in  both  players. These  were  pawn  sacrifices without  compensation, queen  safety, and  the  development  of  pieces. While  we  look  at  all  of  this  data  together and  all  of  our  results, we  were  able  to  come  up  with  a  conclusion. Moderately- rated  players  are  most  accurate and  successful when  they  employ  standard  openings. They  should  be  trained  on  the  fundamentals of  chess  before  learning how  to  move  on to  complicated  openings. Some  of  the  openings  that  we  suggest that  beginner  players  start  off  with are  the  London  System, and  the   Giuoco Piano  game. At  this  time, I  would  just  like  to  thank  you  guys and  I  will  be  accepting  any  questions that  you  have  over  the  report.
JMP has been used by our interdisciplinary group at the NIH Clinical Center for the analysis of clinical research data to test and develop data-driven hypotheses supporting our bench to bedside to community and back translational model. We will present a workflow exemplar visualization of correlates between antibiotic use and patient-specific oral microbiomes. Starting with a spreadsheet of more than 2000 entries of antibiotic medication use in patients with a rare disease, and a separate spreadsheet of bacteria present in the oral microbiome from each patient, we created a visualization of the longitudinal antibiotic use through the course of the treatment program and correlated the use of the antibiotics with the oral microbiome diversity metrics. It is well known that the use of antibiotics can perturb the normal human microbiota yet its global effect on the oral microbiome remains unclear. We describe how the JMP Graph Builder tool was used to further explain whether antibiotics may have affected the oral microbiome in this rare disease patient cohort. The graphical nature of JMP has been used as a tool for data analytics within our group for years and has facilitated the publication of frequently cited peer-reviewed translational clinical research articles.        Hello. My  name  is  Jennifer  Barb, and  I'm  a  research  scientist at  the  National  Institutes of  Health  Clinical  Center. I'm  going  to  talk  to  you  today about  how  I  use  JMP  to  manipulate research  clinical  medication  data and  how  I  was  able  to  create a  publication  quality  figure to  show  how the  patient  medications  were  used through  the  course  of  a  treatment  protocol at  the  Clinical  Center. Clinical  data, especially  in  a  research  setting, can  be  extremely  noisy. There  are  a  lot  of  staff  and  personnel who  are  involved  in  research  protocols, and  the  collection  and  storage of  pertinent  research  data is  not  always  streamlined. I  will  talk  to  you  about  how  I  use the  JMP   Graph Builder  tool to  visualize patient  medication  prescriptions through  the  course of  a  six- month  treatment  protocol and  how  we  were  able  to  visualize what  is  called  a   Shannon Diversity  Index with  relation  to  the  antibiotic  use that  were  prescribed in  the  patient  [inaudible 00:00:55] in  the  clinical  research  setting. I  will  go  through  how  I  created the  illustration  in  JMP using  four  patients with  a  very  rare  disease that  were  enrolled in  the  treatment  protocol. As  part  of  the  treatment  regimen of  this  protocol, the  four  patients  were  prescribed a  range  of  antibiotics, totaling  up  to  21  different  types of  medications. The  data  were  provided  to  me in  a  long  format, including  a  start  and  stop  date of  medication  administration. As  you  can  see  here,  I  zoomed into  the  first  figure  of  the  poster. What  we're  looking  at  here  is  a  snapshot of  what  some of  the  research  data  look  like. In  the  long  format, you  see  that  there  are  repetitive  rows of  the  patient  ID and  there  are  repetitive  rows of  the  different  medications that  the  patient  received during  the  treatment  protocol. There's  a  lot  of  redundancy  here. In  addition  to  that, we  have  a  start  of  medication  date and  a  stop  of  medication  date that  each  person  received. One  of  the  first  steps  I  had  to  take within  the  JMP  data  manipulation  tools was  to  edit  the  medication  name so  that  it  did  not  have  so  many  words in  the  medication  name and  also  did  not  include the  dosage  information so  that  we  could  use  this as  one  of  the  axes  of  the  graph that  I'm  going  to  make. In  addition  to  that, I  had  to  check  the  date  of  patient  consent into  the  treatment  program and  to  see  if  the  start  and  stop  date of  that  person's  medication  administration fell  within  the  treatment  protocol. From  that  point  then,  I  had  to  normalize each  person's  medication  start  and  stop so  that  everybody  had  a  day  one and  it  would  all  corresponded to  the  certain  point of  the  treatment  protocol. All  of  this  information  will  be  used to  create  the  figure  that  I  will  show at  the  end  of  this. Once  I  was  able  to  edit the  medication  name and  create  the  normalized medication  start  and  stop, I  will  then  use  the   Graph Builder  tool. I  also  wanted  to  talk  about one  other  aspect of  this  particular  research  protocol, and  that  is  the  fact that  we  wanted  to  look at  the  oral  microbiome  of  the  patients in  the  treatment  program. What  this  means  is  that we  took  samples of  each  patient's  oral  tongue  brushings and  then  converted  those into  specific  counts  of  bacteria that  were  found  in  their  mouth. What  we  ended  up  wanting  to  do  was to  look  at  how  the  antibiotic  treatment through  the  treatment  protocol might  have  affected  the  oral  microbiome. As  we  know, antibiotics  can  drastically  change your  gut  microbiome and  can  cause  increases  and  decreases of  different  microbial  diversity in  the  gut. But  one  question that  has  not  been  elucidated is  whether  or  not  antibiotic  use would  also  affect  the  oral  microbiome. What  I'm  showing  here  is  that we  have  built  a  set  of  scripts within  the  JMP where  we  install  that  on  the  toolbar. We  have  a  specific  set  of  scripts that  would  calculate  the   Shannon Diversity of  the  bacterial  counts  in  the  table associated  with  the  medications  of  what I  just  showed  on  the  previous  slide. Back  to  the  medication  table, the  first  step  that  I  took  was  to  open  up the  JMP   Graph Builder  tool. The  first  thing  that  I  did was  to  drag  and  drop the  medication  start  and  stop  date into  the  X- axis  as  shown  here. Then  I  would  go  to  the  bar  graph  tool and  click  that  to  make the  data  into  a  bar  graph. The  third  step  was  to  drag  and  drop the  actual  antibiotic shortened  medicine  name into  the  Y- axis. And  then  finally,  in  order  to  create the  graph  so  that  I  could  visualize the  longitudinal  duration of  medication  administration, I  changed  the  bar  type  into  stock. Finally, as  I  talked  to  you  earlier  about  the  way in  which  we  were  able  to  code the  treatment  time  of  the  protocol based  on  the  medication  start  and  stop, we  also  were  able  to  stratify the  antibiotic  use into  this  different  time  point of  the  treatment  protocol  as  here. Now,  all  of  these, if  you  are  familiar with  the  JMP   Graph Builder  tool, is  great  ways  that there's  so  many  different  possibilities on  how  you  can  manipulate  data to  get  a  particular  graph  that  you  want. And  finally,  one  last  thing  we  did was  we  took  the  patient  ID that  was  in  the  medication  table and  colored  each  bar on  the  graph  by  patient. The  final  figure  looks  like  this. So  what  you  see  here is  all  of  the  different  antibiotics that  were  prescribed in  the  treatment  protocol. You  also  see  time  point  B, which  is  the  time  point  between  baseline and  the  treatment  of  the  protocol, and  time point  C, which  is  the  intervention  point starting  at  time  point  C,  and  then  the  end of  the  treatment  protocol. What  you  see  here  is  a  longitudinal  bar indicating  the  amount  of  time a  person  was  on  a  given  antibiotic. And  then  you  also  see each  of  these  different  bars stratified  by  patient  color. This  particular  figure  did  end  up  going into  the  publication, and  it  was  a nother  way  to  look at  a  large  table  of  medications downloaded  from  our  research  database into  a  graphical  form  to  visualize all  of  the  different  medications that  the  patient  received during  the  treatment. Now,  finally,  you  might  want  to  ask, why  do  we  want  to  look  at  this? One  thing  of  importance  for  us was  to  actually  look at  the  oral  microbial  diversity. As  I  mentioned, we  were  able  to  take  a  separate  table that  corresponded  to  the  patients within  the  treatment  protocol and  calculate  what  is  called a   Shannon Diversity  metric. A  higher  diversity  indicates higher  oral  microbial  diversity, and  a  lower  index  indicates lower  microbial  diversity. From  within  JMP, we  were  able  to  superimpose the  treatment  leg between  time  point  A  and  B and  the  change  of  the  diversity  metric from  time  point  the  start  of  the  treatment to  the  end  of  the  treatment. Also,  we're  able  to  look at  within  one  patient how  the  different  antibiotics correspondent  to  this. Then  the  second  leg  of  the  protocol, we  were  able  to  see  a  slight  rebound of  the  diversity  index in  correlation with  the  number  of  antibiotics that  were  used  in  that  treatment  leg. In  conclusion, we  were  able  to  visualize patient- prescribed  antibiotics through  the  course of  a  treatment  protocol using  the  JMP   Graph Builder  tool. We  took  a  table  of  1,289  rows of  medication  employed  in  the  protocol and  created a  simplified  graph  of  visualization. We  also  were  able  to  calculate a  Shannon  Diversity  Index on  bacteria  data  associated with  each  person's  oral  samples. We  superimpose  these  two  graphs, and  it  allowed  us  to  draw  conclusions on  how  the  antibiotics prescribed  to  each  patient might  have  affected  the  oral  microbiome of  individuals  in  the  treatment  protocol. Finally,  our  group  has  used  the graphical  nature  of  JMP  for  many  years in  a  way  to  translate complex  medical  research  data into  data- driven  discovery and  investigation. The  use  of  JMP  has  facilitated many  publications and  highly  cited   research  journals for  our  group. Thank  you  for  your  time  today.