Abstracts

0 attendees

0

Monday, March 7, 2022

Biodiversity loss is a global challenge. Reliable data on the numbers and distribution of species are urgently needed to stem the species hemorrhage. But getting those data is hard. Endangered species are elusive and cryptic, and current monitoring techniques often expensive and unreliable. Frederick Kistner is a founder member of the WildTrack Specialist Group. Together with colleagues Larissa Slaney from Heriot-Watt University, and Zoe Jewell and Sky Alibhai from WildTrack, he is developing a method to identify four overlapping species of otter that exist in southeast Asia: the Eurasian otter (Lutra lutra), the short-clawed otter (Onyx cinereus), the smooth-coated otter (Lutrogale perspicillata), and the hairy-nosed otter (Lutra sumatrana). Morphometric features of otter footprints are extracted using a customized JMP script, implemented in the Footprint Identification Technology (FIT) add-in. The data are partitioned using the JMP validation column creator. Morphometric features are used as input variables to predict the otter species as the target variable. The automatic model selection feature in JMP then identifies the best model. Initial findings comparing prints from Asian short-clawed otters with Eurasian otter yielded 100% classification accuracy on the training (50%), as well as on the test set (50%). Hi, everyone. Thank you for joining us. The title of our talk today is It's Otterly confusing! Short-C lawed, Hairy-N osed, Smooth-C oated, or Eurasian? Just ask JMP! These four species occur in same habitat in Asia, in Eurasia, and the question is, how do you tell them apart? Using our FIT technique, the footprint identification technique developed by WildTrack, we can actually tell them apart. The presentation will be by four people: Fred Kistner, who is at the Karlsruhe Institute of Technology in Germany and also a member of the WildTrack Specialist Group; Larissa Slaney, who is a PhD candidate at Heriot- Watt University, FIT Cheetahs Research Project and also member of the WildTrack Specialist Group. Zoe Jewell and myself, I'm Sky Alibhai, faculty at Duke University and founders of WildTrack and developers of the footprint identification technique and also members of the WildTrack Specialist Group. I'm going to do a very brief demo, a very, very brief introduction, and then I'll hand over to our next speaker. Apart from ultra species that we work on, WildTrack has an extensive number of eminent species that we work on in different parts of the world, ranging from the Amur Tiger in China to black rhino in Africa to jaguar in Brazil. All of them utilizing in one form or another are footprint identification technology. Now what does footprint identification technology actually do? One of the things about the footprint identification technology, FIT, is that it works as an add-in in JMP. It's designed to classify species or subspecies by using the metrics from the footprints, classify sex, classify age class, and even classify individuals. Now those are all elements that are required to understand the population dynamics of any endangered species, the essential foundation elements. The conservation applications of footprint identification technique, the baseline data on numbers and distribution in form on: data- driven scientific conservation strategies, trade in endangered species, human/ animal conflict mitigation, and all these in a way will be shown to represent the way in which otter conservation works. Now I'll hand you over to Larissa Slaney, who will start the process of deconfusion. Thank you. Right. Thank you very much, Sky, for this great introduction. Thank you to JMP, for allowing us to present our research here. We're very pleased about that. Thank you so much for all of you to be here and show this interest in our research. Now before we're going to jump, pun intended, into explaining our data analysis with JMP, I would like to give you some background to this research. We think it is really important to look at this in context because it's not just about what JMP can do, but how this is applied in the real world and how we scientists can use it to... It gives us the opportunity to make changes for the better, basically. Our research looks at footprint analysis as Sky already just said. We are looking at the footprints of Asian otter species. Now here you can see four different Asian otter species and they're all classified as vulnerable or critically- endangered by the IUCN, and the home ranges overlapped, so that's so- called sympatric otter species. On the top left, you can see the smooth-c oated otter. Top right, you can see the Asian short or small- clawed otter, bottom left, the hairy- nosed otter, and at the bottom right, you can see the Eurasian otter. Now, to be able to monitor the different species, we need to collect data, which is difficult with such very elusive species. You hardly ever see them in the wild, but you do see their footprints. Looking at these footprints for each species, when you look at them here, they look very similar, don't they? Now, it's quite tricky to tell them apart. Therefore, we set us the task to find out whether there is a way for FIT to distinguish between the footprints of these four otter species in a scientific and reliable way. Now, we do have an added problem here, because the front footprints of two of the species have a size overlap and the hind footprint of another species is morphologically very similar, and also size- wise quite similar to the front foot of another species. We've got a multi-class classification problem here. Now here you can see a map which shows the distribution ranges of the different otter species. The blue area here, that's a Eurasian otter. Then the red area is the smooth- coated otter. The yellow down here is the small- clawed o tter, and then the pink over here is the hairy-nosed otter. But you can see it, although it looks like a large area, there's actually just a few dotted islands there. What is really interesting about this map, though, it shows you where their home ranges overlap. There are six areas where at least two, if not even three of the species overlap. For conservationists, it's really important to find out where the different species live, to what extent , how large the populations are, and find out as much as possible about the different populations so that we have a good idea of how endangered they are. Now, why is otter conservation important? Well, first of all, otters are classed as Keystone species, and that means that they have an effect on their environment disproportionate to their abundance. Just a few individuals can have quite a big impact. They play a really important part in the food chain and contribute to the environmental equilibrium. They're also seen as Umbrella species, which means that they confer protection to a large number of other species. Basically, if something happens to the otters, that will have an impact on other species as well. They're also an Indicator species, so they actually indicate the health of their environment. They will not live in polluted waters or in polluted wetlands. Otter returning to an area is always a good sign, because that means water quality and wetland health is improving. Now, threats to otters. There are lots of different threats. Pollution is the one we've already mentioned just now. But another problem is the human wildlife conflict, habitat loss. With that also comes loss of prey. A n increasing problem, especially in Asia, is the illegal trade, the illegal wildlife trade. They are particularly after the fur, the fur trade and also after pets. Baby otters are taken out of the wild and used as pets, which is not good for otter conservation at all. Now how do you approach a conservation project like this? Well, first of all, you need to think about how do you want to monitor the population. What do you want to look at? Do you want to look at species distribution? Yes, almost always. Do you want to look at individual ID? Do you want to find out what the sex ratio within a population is? The next thing you need to decide is do you want to use invasive methods that potentially stress or even harm the animals? Or do you want to use non- invasive technologies or methods to monitor the species which will not stress and harm the animals? We, at WildTrack, we focus on non-invasive ways to monitor species. In this particular project, we are completely focusing on footprint identification. Now, once you have made those decisions, you need to obviously train people to help you with the data collection because you can't be everywhere and you can't go everywhere. During times of pandemics, it's even more difficult. So you need to train your team both in- person as well as remotely so that has been a bit of a challenge. Then you need to get all the data collection, and the training and the data collection can happen in-situ, which means in the field or ex- situ, which means in zoos and other conservation organizations. Once you get the data in, that is when you start the data analysis. In our case, that's when we start using JMP. Other typical issues for any conservation project is funding, of course, and also trying to get conservation policies improved, and management strategies for conservation to have those improved. That's basically our end goal. We are collecting all the data. We are analyzing all the data so that at the end of the day, we can give that information to governments or other organizations and they can make an informed decision and make better conservation policies. Let me just go back one more time. On the left hand side here, actually, you can see one of our lovely zookeepers collecting footprints for us. On the right hand side, you can see a footprint image that was sent to us from the wild. That's a mystery footprint. We were asked if we could please find out which species left that footprint behind. That's really what we want, that researchers start to send us footprints and we can help them find out which species lives in their area. Fred will, hopefully later on, help us reveal which species this footprint belongs to. Now, we've asked ourselves three research questions, and at the moment we are still focusing on one. This is an ongoing project. A t the moment, we are focusing on species classification. Can FIT, the footprint identification technology, identify or distinguish between the four different species of otter we are looking at? When we've got enough data and enough particular data, where we definitely know the individuals, we will look at individual classification and also at sex classification. But that's going to be a bit further down the road. So far we have teamed up with nine zoos and otter conservation organizations. We've been training them to collect footprints following our FIT protocol. This has been, again, during COVID, quite challenging. I've not been able to see everybody in person, so some people I've had to train remotely, but they've all been absolutely fantastic, our zoos and zookeepers, and have really risen to this challenge and have started to really send in a lot of images as you can see here on the left. It's still overall much smaller sample size than we want to have. As I said it's an ongoing project, but it is enough to give us the ability to now to share some preliminary results with you so we can draw some conclusions. We've included three otter species in this so far. We've only just started to begin to get h airy- nosed otter prints. There's only one h airy-nosed otter incaptivity in the whole wide world. His or her, I'm not sure, prints are just starting to come in, and we will update at a later time the results with this fourth species in it. But for now, we're going to look at three otter species. Yes, so I think it's time to have a closer look at how we do the data analysis over to Fred. Thank you, Larissa, and let's jump straight into action. Like Sky mentioned previously, FIT has been developed for a wide number of species. When it's fully developed or after leaving production, it's an add-in into JMP. Today, I'm going to demonstrate some parts of the data analysis and some parts of the development before it comes into production. What I am going to say is, in general, I just wanted to give you a little bit of a background of how this development is usually done. Our input data is collected with very little equipment and very simple equipment. That's one of the main advantages of FIT, that it can be widely applied with very little equipment. You only need a smartphone and a ruler. If you want to develop FIT models for certain species, you start with an image database that is usually collected of known individuals as Larissa mentioned. We therefore cooperate with zoos and other wildlife centers. These images are then processed within JMP to extract geometric profiles that extract a lot of measurements, angles, distances. This data can then be used to develop FIT models. The general output is that you want to look at species, sex, individuals. If you're able to edit for individuals, you want to draw conclusions about population size. Once you develop the method, you definitely want to test this on unknown individuals. Again, you look at images and get a prediction of the models. Advantages of FIT based on biometric, it's non-invasive. It's a standardized and cost effective way to monitor elusive wildlife that cannot be monitored by direct observation. It can be implemented for almost any species that leaves a footprint. It can be combined with other non- invasive methods and cross- validated models generally have a high accuracy. How to build these models is something that I would like to demo. What I'm going to demo today is technically looking at different footprints. You see on the top left, you see a hind foot of an Asian small-clawed otter. On the top right, you see a left front of a smooth-coated otter, on the bottom left, a left front of a Eurasian otter, and on the right hand side, you see a right foot footprint from an unknown otter from Nepal. What we are going to do today is we process these images. Then I'm going to show you how to quickly develop a classification model within JMP. Then I'll see what the predictions of these quickly develop methods are going to be. It all starts with image analysis. That's script-based implementation within FIT. In the first step, you usually adjust the size of an image so that the footprint is clearly visible and the dominating part of the field. In order to be replicable, it's important that footprints are aligned following defined rotation points. For otters, these are rotation points below the second and the fourth toe. Then you set a defined set of landmark points. Again, for otters, this is species- specific. But for others, I've chosen 11 landmarks. They're in the center. Sorry, forgot one step. Of course, you need to define a scale first. Here we got 10 centimeters. This is up here. You can add some additional information. Just to keep it simple, I will name th is strike Asian short-clawed otter. Then you set 11 landmark points. You could, for instance, use a cost air function if you want to make this as precise as possible, obviously, but for time reason, I'll just quickly run through them. After setting 11 landmarks, you derive additional points, which are helping points that are also used to extract biometric information. Once you've done that, you'll just start a new table and you go for a pen draw. I'll just quickly run through three more images. Again, you need to resize them. Now, with this image, you can see it's upside down. What I like about JMP is that the image window can actually do some image pre- processing. Now, it's right front. Sorry, I need to flip this one more time. Can do some image processing within JMP and so you don't have to change in between software. That's something that I really like that I can do all my work within one software other than switching in between several software. Again, I said 11 landmark points. This time, I'll just go over them quick and dirty, and hope that the prediction will be accurate enough that's a Eurasian otter. Again, I go append just two more times, one for the smooth-coated otter. Again, I will set the 11 landmarks, and what the landmarks are used for, I'll show you in a second. Derive points, append row. One last time, the mystery footprint that Larissa mentioned that was sent to us from a project in Nepal, that is, to my knowledge, doing some otter monitoring there. One of the species has not been seen for at least 30 years. 1, 2, 3, 4, 5, 6 . What's different in here is that you have a different scale. That scale factor is something that I need to adjust within here. Again, I'll quickly click through the images. This is normally done a little bit more tedious, but for demo sake, I'll try to click through them quickly. And this is an unknown species. What you end up with this was the smooth-coated. What you end up is a big data table . These are points for evaluating the quality of the landmark, which I did not go into within here. You get X and Y coordinates for each landmark. These X and Y coordinates are derived to calculate a large number of measurements. There's more than 100 distances derived, some angles and some areas. There's quite a lot of information extracted out of a single footprint. If you repeat this step that I've just shown several times, you'll end up with a data table like this. This is the data table that I'm going to demo the prediction model on. If you have a look what we have here, if you look at the distributions or species or target variable, you see that I have 405 processed images of Eurasian otters, 278 Asian short-clawed otters, and 127 smooth-coated otters. It's not perfectly equally- distributed groups, but at least each group has quite a significant sample size, which will hopefully work for modeling. Whenever you want to do any sort of supervised modeling, it's a good idea to split your data into training and test data. This can be very easily done in JMP. You have to make validation column within the predictive modeling platform. What I've done is I randomly split my data into 80 percent training data and 20 percent test data, where I will test the models that we're going to build on and see how they perform. All right, so I've previously done this. What I'll do now is I'll just select my training data, which are 648 rows, and I will just have a look into a data view. This is 648 observations. I'll quickly save this as my training set. Again, if you have a look at the distribution, you could see that we have, 100 smooth-coated otter prints, 324 Eurasian otter prints, and 223 ASC. It's the same distribution percentages as with the previous data set. In the next step, I will skip out a big part when it comes to predictive modeling, that is a variable selection. I assume that I have no prior knowledge, so just add into all variables that are available. I have no idea which model is going to work best. What I'm going to do here is I'll use the model screening platform that compares several different machine learning models that are implemented in JMP and just compare step performance on this specific task. Again, my target variable is the species. This is the one that I would like to predict. In total, I have 209 measurements extracted from my footprint data and these are my X variables. These are all factors that can potentially be used. What you see down here is you can choose the method that you would like to run and you can basically choose through all the prediction methods. But for argument's sake and for runtime, I'll only run methods that I know that will run through quickly. You can again make it reproducible by setting a random seed. What's also good about a model screening platform is you can add an internal validation step. We already split our data into training and test set. But in order to have an internal validation on these models that I'm going to develop, I'll add k-fold cross-validation. I'll just put a tick in here so that there are several models are evaluated using the k-fold cross-validation method. Okay, if I just click quickly, go on run. I'll have a summary outlook. I'll see that four of my models have been evaluated. You'll get air square, you'll get performance metrics for those several models, and you can just say which one is the best. You just select the dominant one. You can look into the training or into the test set which will also give you a misclassification rate. You can see that misclassification rate for Bootstrap Forest was quite impressive. It's almost 95 percent of the data was correctly classified in the validation set. For argument's sake, let's say, that the Bootstrap Forest, the Decision Tree, and the Discriminant Analysis were the three best models and I'm not sure which one is the best. I can just run those three models as selected. They'll pop up in their respective platform. What I can do is I'll just save the prediction formula into my Formula Depot. I'll do this for the Decision Tree model, so I can close this. I'll do this for the Discriminant Analysis, which is done here. Last but not least, I can do this for the Bootstrap Forest model here. What I have now is... Someone I can close, I'm sorry. I have three models in my Formula Depot. I now want to evaluate how these models perform on new data. Again, I go into my initial data table and I select my test set, so all the variables that have been randomly choosing as a validation set. I'll just quickly save this as Test Set. In the next step, I can technically open my Formula Depot and I can run all those models in the new data table. I will run them on the test set. I want to run all three of the models. I can do a model comparison where I run all the three models on the test set. That's actually what I wanted to show. You'll see down here, I hope you can see this. Try to zoom in a little bit that the misclassification rates of these models were actually quite low. Test set is new data that has not been seen on the models and there was a misclassification rate for the Decision Tree model of 16 percent, while the Discriminant Analysis and the Bootstrap Forest only had a misclassification rate of 12 percent. You have the highest air square value for the Bootstrap Forest, and there are several other metrics that you want to look at. For us, the most important metric is the prediction, right or wrong. I usually look a lot at the misclassification rate. But obviously, all the other metrics can be used to evaluate and generate good models. Last but not least, we want to have a look of how these models actually perform on the footprints that were just previously processed. Again, we go back into our... Now, where is it? in our Formula Depot and we'll run this at four images that we previously processed. If we open this data table, you can have a look at the prediction. What we can see here. I'll make it a little bit more obviously. So this was the original data, and if I just look into the species for the first model, everything was predicted correctly. The first model was the Decision Tree that predicted ASC for ASC, Eurasian otter for Eurasian otter, smooth-coated for smooth-coated, and same with the Discriminant Analysis and same with the Bootstrap Forest. All three models actually predicted the right species to the images that we have processed, and all models were also consistent about their prediction when it comes to the Eurasian otter. I'm doing my PhD mainly on the Eurasian otter species. This was also something that I would have guessed as an expert on this particular species footprint. So it's quite consistent and it works really well. Obviously, you can add more steps and you can fine tune your classification rate even more if you look more into feature selection. But I think what I wanted to show you with this demo is how easily a question of what species it is can be answered using JMP. I hope this was quite intuitive. Let me draw some conclusions on that, some results. What I really like about JMP is a great all- in- one solution. I can extract biometric data from images within the FIT add-in. I don't have to switch software to do the data analysis. Directly, my extracted biometric data can be analyzed. And not only analyzed in a descriptive way, I can build classification models. There's state- of- the art machine learning models implemented into that, and so it's a one- and- all solution. There's obviously other ways to do that as well, but I just like the practicality. For our particular research question, so our otter species specific classification models at this stage, how much data we have, they have performed very well. They're able to protect the species of single unknown otter footprints with high classification rate. For instance, here neural net, which I did not run in this example because the runtime is a little bit longer, had a misclassification rate of only 10 percent on the same test set. It could even increase the classification accuracy of another two percent. If you dive in deeper, you can even increase this a little bit more. One thing that our experience from working with footprints in the field increases the classification accuracy by quite a lot is rather than working with single footprints is to work with trails. An animal, depending on where you are, depending on the substrate, doesn't only leave a single footprint. Without going too much into detail, a footprint can be a quite complex matter because it variates a lot with the substrate that an animal is moving through with the speed of an animal with the gate. There can be quite a bit of noise and background variation. If you work with multiple footprints and take an average of a prediction instead of single footprints, you can address this variation quite a bit more. This comes especially important when you want to look at individual identification. Last but not least, JMP is also great to gain insights on the importance of variables. There's several ways within JMP that you can look into which are actually the variables that are contributing the most to predictions. You can either look at tree- based methods, where you look at the classification trees and where the splits are done. You can look into column contributions again for tree- based methods, if you work with Bootstrap F orest, or XGBoost, or something like that. You can see which columns are actually choosing how many times. You could look into the prediction profiler if you use the normal modeling platform within JMP. You can technically have a look of what the prediction of your model, how it's going to change if you change certain values. Or you can do something like a Discriminant Analysis, where you just follow the F-ratios of how variables are selected. This will then, again, give you a lot of insights because it really depends on your question. Do you want to have a prediction from an external advisor of what species you're looking at? Or do you want to give a guide for working in the field of which measurements are worth looking into when you're into field, when you want to make the classification on the subject, like just on expert knowledge? Yeah, so that's basically it from my side. I would like to thank all the contributing zoos and wildlife parks who allowed us to work with their beautiful animals and share data with us. Especially, I would like to thank Grace Yoxon from the IOSF, the otter survival firm who got us into contact with many of them, KHYS, our Karlsruhe House of Young Scientists who actually funded this study, and most importantly, Joseph Morgan from JMP, who is very helpful when it comes to modeling and FIT. Then he has been giving me advice more than once when I get any JSL scripting issues. Yeah, that's pretty much it from my side. Thank you very much for listening so far, everyone. I just really want to round up here. I think that Sky, and Larissa, and Fred have outlined beautifully the challenges we're facing in identifying these different species of otter around the world and the way in which JMP can help us classify them and bring some clarity to this picture. Where are they and how many are there? I'd like to just quickly talk about what's next. How are we building on this? One of the things we're doing is building artificial intelligence into this picture so that it will allow us to filter and sort a much greater volume of data as it comes in. As both Fred and Larissa have said, we need more data. Artificial intelligence has the potential, which we've already exposed in some early training and test and field data as you can see at the bottom of this screen. We're getting reasonably good accuracy in our initial trials with AI. We think that it will never give us quite the resolution that JMP will give us. What we're aiming to do is have an AI platform which will be easy for citizen scientists to feed data into. We'll have JMP as the top level classifier on that platform by integrating it. But the key here really is that this cryptic ground evidence left behind by otters and all the other species is there for us to decode if we can find a way to do it. It really is transformative in conservation to be able to have a cheap and quick technique to know where these endangered species are. We're very optimistic that using a baseline AI classifier with JMP as a final classifier, we'll be able to make that technique not only deliver the data we need, but integrate people all over the world as citizen scientists to be part of that. We're really grateful to the JMP community for supporting us through this whole journey. We're constantly making new strides. We know that there's interest in this community, and we hope that they will join us when our new mobile app comes out to even start collecting data themselves and pushing this forward to where we want it to be, which is a classifier for all endangered species all over the world. Here's to lots and lots of points on the map, and thank you all for listening.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

SPC and control charting is a common procedure in industry. Normally, you are controlling and observing a single measure over time. These data are displayed with ±3s limits around the mean on a chart. However, when kinetical curves or other time-dependent behaviors are a matter of quality and consistency of a process, it is much more difficult to display in a SPC chart. These curves are often displayed within their maximal and minimal specification for each time point, which makes the off-spec curves visible. How can “off-spec” curves be defined while staying within these max-min limits? The first method to try might be the principal component analysis (PCA). If the runtime stamps are always the intervals, then it's easy to achieve results. However, if they vary, interpolation of the Ys that will be on the same stamp is needed, which complicates data preparation. With the Functional Data Explorer in JMP Pro, it becomes very convenient to display the different curves as principal components in a T 2 -control chart. This presentation shows how we used this tool for quality control for a pressure leakage test and how we made it simple for the practitioner to use. Okay. Hello. Thanks for the nice introduction. Today I want to present together with Stefan Vinzens from LRE Medical a work we've done in the last year on control charting kinetical and other time- dependent measurements. So why is that important to present? So in case you have a time- dependent curves and the curves themselves are important for the quality of your product, so it's often very hard to define any kind of specification. And in our case here, these curves were evaluated by specialists. So every measurement was sent to a specialist, looked at the curve, said, "Yes, okay, or not okay." This is pretty time- consuming and compassion and so on. And on top of that, it's even bad when the person is sick or in vacation. Also it's a person- dependent thing. So often it happens that the person has different moods or different obligation or different priorities and so on. So let's say the adjustment of the curve may vary a little bit. So it's kind of reproducibility problem And we wanted to stop that. So that was the reason why we started with it. So for example, here you see a selection from hundreds of these curves we measured. And you see here it's a pressure holding measurement. So we have here pressure versus time in seconds. And here you see a bunch of curves. And the green ones, you see they are called true. This means true is accepted, they are good. And false means rejected, they are not good. So there is a bad product. So what you see here that the green ones are relatively tied together. Here in this highlighting screenshot, you see better. So they're pretty close together. And then we have a selection of red ones which are either apart from the green ones, but there are also two ones which are more or less in the same regime. But as you see, they have completely different shapes. There are some edges and other s-shaped curves and so on and to forth. So if it would make just a simple t hree-sigma limit ± around the good ones, for example, as in the upper case here, we would include also the nongood red ones here from the lower picture. So a simple ± three Sigma limit approach around this series of curves would not be target leading. So we need something which includes the position which is done here with the basic monument. But also we want to have something which takes care about the shape of these functions, of these groups, How to analyze the shapes and positions? So there are actually two approaches. One is pretty long- known already. This is the principal component analysis. That's the first one. And in more recent times JMP came up with a functional data explorer which also gives us the possibility to do the same. But as we have seen, it was not really the same, so it was different. So let's start with the old approach, principal component analysis For doing so, you need to transform the long table I just show you in the next picture here on the left side, a long table where you have columns with the part number, for example, with the test date. But also and this is the important part of the runtime and the pressure and this for each value. And you see here, this is in seconds. So we measured every some milliseconds. And as you see also, it is pretty hard that for the next part it's exactly the same series of numbers here, exactly the same time when we have the next pressure point. And this is actually needed when you want to make a principal component analysis because you have to transform this long table into a wide table where you have one row per part and then you have the white table where you have for each time slot and then the data points. So what we need to do now is, first of all, we need to bring all these runtimes on the same scale. Here, we have done that by just calculating the longest time as 100% and the shortest is zero. And then we have every numbers from millisecond seconds transferred into this percent scale. Then we interpolated because still they were not on the same time slot actually. So we had to bring all the Y numbers on the same time point. This means we need to do a kind of interpolation to have them all on the same. And then when we have that, so we created a new column with the standard relative times. I've just seen them in the example in the next slide, so 00001 and so on. And then we transpose them into this white form. And from there on, then we could do the principal component analysis, saving the printable components and calculate the T squares from them and build the control charts from these T squares values. So here you see it's the slot 0%, 1%, 2% and so on. And here are the pressure values. In a parallel plot, it looks kind of like this. So we have here the different slots here on the X- axis and the pressure is still the same numbers as before. And you see that the curves are looking pretty much the same as before. But now we have about 100 data points and before we had thousands. So it's the density of the points are a little less, but still the curves and the shapes and the positions and everything are the same as before. And we take now this white table and on all these different time slots, we do a principle component analysis. Then we find here this scoreboard for it. What you already see here is that in the middle part, we have here all the green dots and the green ones, as you see here are the accepted ones or the good parts surrounded by the red dots, the rejected ones, the false ones. And as you see also in this example here we can stay with two principal components because the first principal components covers already 98% of the variation in the second one, 2% and the other ones you can neglect. So I think it's good enough to save just the first two principle components. And then from these two principal components, we calculated the T squares. This means T Square is principal component, one squared plus principal components. Two squares would be then the T squared here for this data point. Calculating this one . So for every data point we calculated, this T square and bring it on a control chart Here, you see the control limits calculated only from the good, from the green dots, from the good runs, from the good curve. So you see it down here from the moving range. So this means, okay. This controlling, it represents the normal natural variation, what we expect from the good ones. And all the non- good ones, the red ones should be outside of this regime here. And you see it's mostly done, but not for these two points here for part number 14 and part number 15. So they're inside the control limit. But that is not what we want to see. So first of all we have to understand what are these two. So if you have a look onto the next picture, then and highlight just these two, 14 and 15, then we see, "Ah, okay." T hese are the ones here which are really within the regime of the green ones. The other ones which are further apart from it, they are easily excluded or not excluded, but let's say distinguished here with this control shot. But these parts were not. So what we learned from here, that is the principal component approach as we have done it here or have done it in former times, there is no information about the shape of the curve. It's more information about the position where it is on this Y scale here. So we need something else which takes care about the shape of this curve. So when we tested the FDE, the Function Data Explorer, and the first good news is you can just take the long table as is. So there's no need for data transformation or bringing the different time points on the same slot number or whatever. So you just take the raw data as they are. Perhaps you will exclude some real outliers where there's the machine has given wrong numbers or whatever. But all the other stuff you can just take as is and sometimes perhaps you don't want to do it on the transformation or so. It's not really necessary. And in this case, we have just taken the raw data as they are. So starting doing an FDE on this, we see here again our pressure time curve as we've seen them before. But now you see these blue verticals here and these blue verticals represent the number of knots. So what is a knot? So here we are fitting a spline curve. And the spline curve is, let's say if you take a ruler and then you want to make it a bit more flexible to bend it onto all these different curves or, let's say, to adjust it the best way to the data. And the more, not this ruler has the more flexible it is. And so the better you can adjust all these different points. Here, we used these 20 knots thing. So these are 20 verticals here and here with the basic information criterion, you see that we get on all different functions. We get the smallest BICs. And just have an optical control on this, you see on the lower left, all the data points separated out for all the different curves here and with a red line on top of it representing the spline curve which we fitted. It doesn't matter if the curve is pretty straightforward like here or has more edges or whatever, it's perfectly aligned. So just optically, it looks very good. And you can also check that with the diagnostic plots. For example, here this blind curve, the predicted values are displayed versus the actual ones. And you see they are perfectly aligned around these 45 degree line here. And also the residuals, the ones are left and right here from this prediction line, they are pretty small, so there's not a lot of error left, which is not represented by this blanker. So the fit looks really, really excellent. So we have not taking the parameters for this split, sorry for the spline and did a functional FDE. Well, I'm completely confused today. And we did a principal component analysis on these. So we separated the eigenvalues, which is the weighing factors, and the eigen functions for each of these curves. Then we can display them on a scoreboard. Here the eigenvalues for each of these curves. And you see here, again with the BIC credit criterion, you see how many numbers of functional principle components you should optimally use. And you see is the minimum with two. So we stayed with two again as the one before. And now we have this score board, and it looks pretty much the same as the one before. We have here the center part, the green of the curves, which are good, surrounded by the red ones, which are rejected from the specialist. And now we saved all these functional prints and components and built a T square plot on it. And here again, the same picture, the square. If we take all the data points to calculate the control limits, okay, then besides these two, everything is part is in control. But this is actually not what we want to have. So we want to understand what is the normal variation of the good ones to differentiate them from the variation of the non- good ones. And so down here on the second plot here, we calculated the control limits only from the good ones. And then you see, okay, nearly all red curves are outside. We're out of control. But there are two. So the number 14 and the number two here are again directly on the borderline. But all the others are clearly separated. So let's have a look on the two and 14 which curves these are. And then we see here. Ah, okay. So they are really different from the red ones, that's clear. But also kind of different from the majority of the green ones. So the number two, which is defined as being true is the one with a steeper slope here compared to all the other green ones here. And the number 14 is this guy here, which is more or less on the upper end of the regime of the greens. So obviously, this little differences here in the shape is not strong enough to come up really in this statistical calculation. But we could really exclude here this guy here from the others with a strong edge here. And perhaps this one if the expert would have a look onto this second time or perhaps it would have rejected it because it's too steep a slope or whatever. So these are border lining, I would say in both cases here with the statistical approach but also, I guess with the manual approach. But we could clearly detect number 15, which is the one here in the middle, as being a non- normal behavior with this tool. So you see, the FDE has also some limitations. But overall, in addition to the standard PCA approach, it comes up with these shape information of the curves. And this is also part of distinguishing in a control chart if a curve or a measurement is normally varying or is it not normal anymore. To conclude these things, the standard PCA approach combined with the T square analysis or control chart is good for detecting the position of the curve and distinguish them from curves which are not in this regime. But as we have seen, it's lacking if the shape of the curve is of importance. And with the FDE approach, again, combined with the T square analysis, first of all, it needs much, much, much a less startup preparation. But on top of that, it's good for checking the position as the above, but also because it has shape information of the curves. It is also including that in the good nongood understanding. In our work, we stopped at this point, but principally, you can build a kind of automatic. If you use these curves or these principal components and the information of good and bad, you can automize the good and bad. But if you have a model behind which predicts you, then what you will see. But we haven't done that. We stopped here on the T square chart to understand the variation and what's varying more than normal variation. Thank you very much for your attention. And if you have questions, I'm open to answer them now. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

As one of the global technology leaders in the wafer industry, Siltronic AG already has strong analytics capabilities. In this presentation, I share my experience in establishing the use of JMP as the standard in my company, as well as my current view on JMP usage and my future roadmap. I hope to discuss and exchange experiences with other users. JMP has been used at Siltronic for many years now; I have personally used it for more than 10 years. As we serve the quality-focused semiconductor industry, we need to have good tools and establish advanced methods to be successful. At Siltronic, several other tools are also used for data analytics, but many of them require advanced skills. As a result, they are not always accessible by many of the process engineers, most of whom have experience with Excel and basic statistical knowledge. Simply establishing JMP as the standard for statistical analysis does not by itself guarantee that Siltronic realizes the full potential of data analytics. However, many features in JMP help accelerate learning and speed up the deployment of advanced data analytics to improve processes. Hello everyone, thanks for joining us today and I really appreciate to have the opportunity to tell you my story: How to enhance data analytic skills by advocating JMP in a Company. So this is the outline. After introducing my company Siltronic and myself Georg Raming, I will tell about data analytics at Siltronic and then how I started with JMP and what my way was, what my first target was, second target, and what our current approach is. So about Siltronic. We have four world- class production sites in United States, in Europe, Germany: Burghausen and Freiberg, and in Asia: Singapore. We have around 4,000 employees with global scale and reach and profound knowledge in silicon technologies, more than 50 years. So this is the history of Siltronic. So the first silicon wafers have been developed in 1962. The first 200 millimeter wafer in 1984, and meanwhile we have found that some sites in Portland, United States, and Freiberg in Germany the first 300 millimeter wafer has been developed in 1990 and first 300 millimeter production in Freiberg in 2004. And currently, we are developing a new fab like here written in Singapore 2021. So this is the electronics value chain. Here you can see coming from the raw material, its ultra- pure silicon worth about $1.2 billion, and semiconductor silicon wafers are around tenfold worth $11.2 billion. The semiconductors are even worth much more, and the electronics are around $1 ,650 billion And the high demand for these products drive our business with silicon semiconductor wafers. This is where our sites are in more detail. So we have a 200 millimeter fab in Portland, United States. We have 300 millimeter wafer fab and small diameter fabs in Burghausen and Freiberg as well crystal pulling and 300 millimeter wafer fab. And in Singapore we have a 200 millimeter wafer fab, and 300 millimeter wafer fab and 300 millimeter crystal pulling. So the Singapore fabs are among the world's newest and largest , and Central R& D hub is in Burghausen, Germany. This is how s ilicon wafers are produced. So starting at the raw material ultra-pure silicon, we have two methods for growing single crystals that is Czochralski pulling and Float Zone pulling. And after growing the ingot, the mechanical preparation takes place like ingot grinding, multi wire slicing, edge rounding. And then the wafer steps come like your laser marking, lapping, cleaning, and edging polishing. And for a part of the product epitaxi. Our product portfolio mostly is 300 millimeter wafer with CZ process Czochralski for memory logic and analog, and smaller part is 200 millimeter and 125 millimeter with pulling Czochralski ingots and Float Zone ingots. And there we have applications like Logic, Analog, Discretes, image sensors, Power Optoelectronics, and IGBTs. And special products like highly- doped wafers as well. So our key requirements on the ingot side are purity, homogeneity, mechanical stability, oxygen content, and the more like this. On the wafer side, we have flatness, uniformity, edge flatness, surface cleanliness, and the like more. And to make the requirements a little more impressive and understandable, what means purity of one part per trillion? It is not more than three to four dissolved sugar cubes in a lake like Chlemsee in Bavaria, Germany. And flatness of a wafe r means 20 nanometer in height on a wafer, like a flat leaf on the surface of the Chlemsee . Now about me. So I'm an Electrical Engineer with a PhD in simulation of electrothermal processes. I also have some statistical background like Six Sigma Black Belt, and my task is development of silicon single growth processes at Siltronic in Burg hausen. And I have also many years of experience in data science, like tasks. It's mainly building my own environment for working and the environment for my group and others, and I'm responsible for JMP software at Siltronic for more than 200 users. Data analytics at Siltronic. So we have also data science professionals, and they are providing services to all, and if we need as engineers some reports, they are mainly static, and the definition of new reports takes some time and is not that flexible as we would need. So we need to do it often on ourselves, and the professionals are using server technologies like Cognos Analytics, Python and others more but we are lucky to have the most of our data on data bases. And on the other side, JMP is the standard statis tics tool for everyone. Excel is used additionally. And always with JMP there are some teething problems because some activation energy is needed to make let's say, new stuff or so working with JMP. But JMP allows full scale data analytics for everyone. It is like data acquisition, manipulation, data exploration and visualization, advanced statistics, modeling, DOE, and others. How did I made my start with JMP? So I'm working at Siltronic since 2001, and as far as I remember, I have always been looking for a good general full- scale tool. And around 2009, some years already using JMP, I was attracted by the nice explorative possibilities of Graph Builder in JMP, but I did not feel comfortable with the data table due to lack of understanding. And I felt a complicated data- in procedure because I had my tools in Excel dragging data from database and I had to throughput it to JMP. And what me then gave really a boost is that I understood how to directly import data from database into JMP. And after I understood the how to, I decided to use JMP as my standard tool for data analytics. And I very much appreciate the ability to store the queries in the JMP data table, to have the documentation on where the data comes from and the ability to update . And also nice is that JMP saves graph and other evaluations as scripts, and that I use a lot. My first vision, I was alone. So I did only see me, but I decided to become an expert on JMP. I liked to know every button in JMP, but this isn't possible, I learned later, and I did not see the others, only my environment, and I hadn't yet the idea of collaboration internally. External collaboration is always difficult due to confidentiality of data. And I a lot use data tables with query like shown here with the nice scripts here to update data from database, and the JMP table to work on like here for famous big class data table. I started to explore the features of JMP, but the requirement of my work by far did not reflect JMP's full range. So I started to learn in the community, in the web, and also to explore cases of colleagues, just by interest. And later I saw that deployment of many of these features are also beneficial to my work. And meanwhile, I like a lot the JMP starter window that shows the dynamic range of the software, and I use it a lot for training to show what is possible to have an overview. My second vision was when I recognized the others. I felt that I could support others also in using advanced data analytics, and I started some activities like JMP Workshop. It was one show for all, so I invited all interested colleagues, but I got only very few people presenting. It's difficult to get people involved into that, and the skill level was very different to make the show efficient. And it is even difficult to get some representative data on skill levels of the participants. We also offered some special support one to one, and this worked well for a few people and it was important for me and other trainers to learn what the requirements are of the colleagues, what they really need. Additionally, we offer basic training and this turned out to be the most important and effective measure to get also into contact with new staff and other people and so on. It was also a nice story how to get others involved as a trainer. So we tried to encourage recently hired staff because they are eager to learn, they have available time, resources, and good communication skills and this was quite successful to encourage these people. Last but not least, involvement of management is important to establish a visible collaboration and to justify the effort that is put into this. And this all is not a self- seller, there is a driver needed. My current vision is more on establishing a network and making like a snowball effect because with a growing number of users it's not possible anymore by one person to address all the people using JMP. So the workload has to be distributed, and more communication lines are needed. So my target currently is to make everyone knowing a JMP expert, and to offer easy access to JMP knowledge internally without the before mentioned know-how problem. And to increase usage and knowledge of JMP, and to make the whole story visible, including management and included into the procedures. And that's why we built up a communication structure like shown here. And on the top, there is JMP component o wner we have for each site. And the component owner is responsible for the technical things, software topics, and knowledge training and so. And then we have the power users that are in good contact with each other and with the component owner. And these are that people that should be known by every user, every user should know power user in his Department, in her Department that she or he can reach out for when any questions occur. And other current measures where we get also good support from the JMP team. So thanks to JMP is... Yes, beginners training I already mentioned this is really most important, and also most easy to establish. It's a network for free and you get high visibility. And we also included STIPS by JMP in our training program, and this is excellent for learning statistics and JMP. With Martin Demel from JMP, we installed Jour-Fixe every month, and there we have very good discussions and more and more people get encouraged to participate into this meeting. It's very well working. We included the courses in our internal training system. We also installed a ToolBox. This is a JMP script that collects all files in a folder structure and makes data and analysis accessible by all users. And there are also other measures, of course, depending on the company like special workshop courses, also with other focus like SQL database language, the infrastructure of data and statistics in general. My summary is learning and implementing JMP in a company takes its time. It does not come for free and it needs a lot of personal engagement. But it's worth doing. It will enhance data analytics skills in a company. You need management support. Last but not least to pay the licenses, but also for other things like make it better visible and make it running. And all the solutions of course, depend on the company and on the people. I felt it was a good idea to start small some projects and to see how it developed. It's important to build and enhance networks on this and to evaluate the interactions, what is happening with the people using JMP and how they interact, and if necessary, to rethink the strategy. It's worth doing and it will pay off after a short time by enhanced evaluation possibilities and better decision. And last not least you can see it in the ease of use of JMP resulting in fun. Okay. Thank you for listening. I'm finished with my presentation and I would be lucky to answer questions if there are any from you or how others do on this topic. Thanks.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

This presentation demonstrates how consumer research methods, choice design modeling and reliability analysis platforms in JMP 16 were used to help high school students optimize their utility and satisfaction in purchasing the right laptop for school usage, pick the right school courses, maximize their exam testing performance, and optimize their time spent on high school STEM projects (with a STEAMS approach). For example, Choice Design and Model platforms were used to conduct survey analysis of laptop purchasing preferences. This analysis was supplanted with Reliability forecast modeling (Life distribution) and back of envelope calculations in JMP to provide greater context for optimal decision-making in purchasing the best laptop for use. Subsequent phases of the project included MaxDiff Design and Model platforms which were used to conduct survey analysis of the popularity and difficulty of school courses. The Item Analysis platform was later used to study exam question profiles to help students and instructors assess exam difficulty. The Latent Class Analysis platform was used to study multiple choice exams. Finally, using Explore Patterns/Explore Outliers, potentially unusual patterns in responses among examinees were detected, thus uncovering possible evidence of exam cheating. In this phase of the project work presented, we demonstrate how a modern choice design can be used to optimize survey methodology that avoids sampling bias and we show how to use the Choice Modeling Platform to appropriately analyze survey data. Hi. Thanks everyone for joining us. The title of this presentation is Choice Design and Max Difference Analysis in Optimizing a High School Laptop Purchase in the Context of a STEM project or STEAMS project. My name is Patrick Giuliano. I am a co- author and co- presenter for this presentation. And of course, this is JMP Discovery Summit Europe 2022. And I'm happy to be presenting today. So before I get into our project definition or project charter, I just wanted to mention some general context for this project. So this project has a STEAMS orientation, which is basically a STEM framework, but with the addition of a focus on practical AI and statistics, and through the lens of JMP. All right. So the opportunity statement for us here is that every year, students in grade nine with respect to Stanford Online High School, they need to take core courses and need to do a series of projects. In fact, there are many projects per year, as many as 150, and many of them require the collection of survey data. So in the context of survey data collection, JMP has a powerful Choice Design and choice modeling platform as well as Max Difference design capabilities, and those can be used to optimize both survey methodology and analyze survey data. Within this particular use case, we're going to use JMP 16 Choice Design choice modeling platforms to study consumer research, and also specifically to assist with the choice of a laptop, the optimal choice of a laptop for a student. We're going to also take this a step further and look at some reliability questions and do some calculations to look at the opportunity costs associated with purchasing a warranty at different stages of ownership. All right. So quick orientation to our STEM diagram here. We had ten respondents in the context of our example, our sample data set. In fact, this is a JMP sample data set, which I will provide on the user community and is also available in JMP Sample Data directory. There is ten respondents in this survey. As you can see here in the lower left hand corner of the slide, there are four attributes that change within a choice set. There are two profiles per choice set. There's eight choice sets per survey, one survey total to be distributed, and of course, ten responses as we indicated. So in terms of the technology associated with the laptop, we're looking at four key attributes. We're looking at hard disk, drive space, processor speed, battery life, and the computer cost. And we can see a picture of the design as well as the different choice sets that are paired by number with the different attributes in the columns. And then, quite nicely, we see a probability profiler which really just shows the opportunity space in relation to the changes in the axis across the four parameters on the right result in a change in probability or likelihood of purchase. Okay. So let's provide an orientation to the science and the statistics with respect to consumer research. So we're going to think about collecting information, and we want that information to somehow reflect how customers use their particular products or services in general. We want to have some understanding of how satisfied customers are with their purchase and what features they might desire. What insights they can use to improve the problem statement that they're working on in the context of the purchase that they're making. In this project, we use JMP's Consumer Research menu, and specifically, we're going to focus on the Choice Design platform. Okay, so here are some nice graphics that just show, that highlight the consumer research process. And like many processes, we can see that it's very iterative and cyclical. It's focused on strategy development, decision making, improvement, and solving challenges. But as we'll see later in the analysis, there are specific modeling considerations that Choice Design takes into account to make our modeling procedure a little bit simpler and more effective for these particular types of problems. All right, so going back to our overview of the study. So the voice of the customer really speaks to how the manufacturer decides to construct the design in our case. And so we have two sets of profiles, again, that will be administered to ten respondents. And the goal is to understand how laptop purchasers view the advantages of a collection of these four attributes. All right. So our particular use case here is going to be the Dell Latitude 5400 Chromebook. This is a very common budget laptop that would be considered appropriate for student usage. Okay, so here's an overview of the Choice Design modeling platform. We can see that the four parameters are specified under the Attributes section. HDD size, processor speed, battery life and sale price, and the high low levels of the design are specified over at the right under Attribute levels. So you can see that by generating this design, JMP gives us a preview of the design and the evaluate design platform with our choice sets, and our hard drive stays, our disk size, our speed, our battery life, and our corresponding price for each of the choice sets, where the choice sets represent a choice between one computer possessing a certain set of features and another. So as I mentioned before, we're going to bring in the sample data and actually point to the specific location where the data is located. We're going to go ahead and put that in the community for our users to practice with this data. Okay.` Here is the display of the model specification window for Choice Design. So after we've generated the design, we have to fit the design. And so we can see that the structure of a design is ten respondents that are paired. And so we're going to set the data format to one table, stacked. Our select data table, it populates into the Laptop Results Data table. Our response is related to probability of purchase, which is related to price. We put subject into subject ID, choice set in choice set ID, responding and grouping. We cast our four attributes into the X role here. And then we have a subject effects, or we can have subject effects, which are optional, which we didn't consider in this particular context that we can have them. And we also have an option for missing value imputation, which is nice. One of the things you'll notice here, though, is that if you're familiar with JMP, you'll see that the run model option is here, but there are no other options and in many other modeling platforms. In JMP, we have what's called personality selection. In this particular context, we're limited to a specific model under a specific framework only. Why is that? Well, from our perspective, this is likely because this modeling strategy is very specific consumer research, and we would really only want to consider main effects between choices because those effects reflect the choice modeling structure that we're implementing. We pick from either one set of features or the other. Okay. So principally, what a mathematical model or what statistical model is underpinning this type of procedure? It's really a logistic regression type model because our wide response is a probability. Our response can be thought of as whether we're going to make a purchase or not, and therefore is the likelihood of making a purchase. And so what we see in the summary of the model output is that a negative estimate on the terms in the model indicates that the probability of purchasing chance is lower at that particular attribute level. And so we can see from the summary, the parameter estimates and from the ranking of the effects summary, and that buyers generally prefer a larger hard drive size, faster speed, longer battery life, and the cheaper laptop. And we can see that speed and price are really the most significant predictors on probability of purchase. And we spoke a little bit on the buyer side about why there isn't an interaction term. We think that it's really because in the context of this research problems, it's not really a practical consideration. But certainly, it's not likely due to lack of degrees of freedom because in this particular data set, we had over 100 observations and we're only fitting four terms. Okay. So another thing to notice here, I think, that's not immediately obvious on the slide is there's some discussion or notes at the bottom of the primer estimates that say converged in the gradient. And that just really speaks to the fact that this model estimation procedure, this likelihood- based modeling procedure is an iterative procedure. It's not necessarily deterministic, and it involves iterating to find an optimal solution. Okay. So let's take a look at the effect marginal analysis and the context of this report. So we can see here that the report shows marginal probability for each of the four attributes that they're different levels. And so what I'm highlighting here is that the marginal probabilities that are the most different from each other indicate where there's the most differentiation in terms of making the purchase decision. So clearly, price and computer processing speed are the most important in terms of really driving that probability decision. And so as an example here, you can see that 71% of buyers may choose an 80 gigabyte over 40 gigabyte hard drive size, as indicated by the marginal probability of .7129, and 68% of buyers may choose, for example, $1,000 over a $1,200 or $1500 price. So that's indicated by the .6843 in this marginal probability. So we can look within each of these marginal probability panels to look at what the preference would be on the basis of a given factor, like price or speed, and also look across the effect marginal panels, and specifically look at the differences in marginal probability or marginal utility to see which factors are most differentiating in terms of driving the purchase decision. Okay. So the next thing I'm going to talk about is the utility profiler. So in addition to the response probability or likelihood of purchase, we also have something called utility. And so utility is really something like probability, but it's defined differently. And the utility profiler report shows, in effect, a measure of the buyer's satisfaction at a particular scenario. In the context of utility, a higher utility indicates higher happiness, if you will; and a lower utility, below zero, for example, indicates relative unhappiness. And so we can see from this profiler that the utility is increased. Or if you will maximize when purchasers spend the least amount, they have the longest battery life, the highest processor speed, and the largest size, which is completely intuitive in this context. And what we can do is we can think about utility very much like we can think about desirability in the context of traditional experimental design, where we want to maximize this utility function in order to maximize the buyer satisfaction. And so as I mentioned here, there is a relationship between probability and utility. Mathematically, what is it? Well, we don't get into that here, but it is articulated in JMP's documentation. And this is something that I'm going to be thinking about as part of the dialogue to this talk when it's archived in the user community. So look forward to that. Okay. So now, let's look at the probability profiler. So the probability profiler is going to be similar to the utility profiler, as I discussed in the prior slide, but it's, of course, a little bit different. So what is it, practically? Well, the profiler set at these particular settings of X, the response probability is 12%. So the way we can interpret this is we can say that 12% of buyers would consider to spend $1,500 to get a laptop with a 40 gigabyte disk, 1.5 GHz processor and a four hour battery life. Okay. So the way we like to think about this is that for any special condition that you want to know what the probability is at a specific set of factor levels, the probability might be more useful than, for example, utility, which describes a measure of the buyer's overall satisfaction. And you'll notice clearly that the profiler is limited to just two levels. And again, this goes back to the nature of choice design and consumer research. And we really want to h one in on the buyer's interest by giving them a successive series of dichotomous choices to choose between. Like how when we're at the optometrist, the optometrist does the lens flipping and says "Is A better or B better?" A or B and then you go on to the next one, and then she asks for a similar selection between A or B. Okay, so the next part of the project is really around warranty consideration. So I spoke about this a little bit in the beginning. So let's go into this in a little bit more depth before we conclude. So suppose the optimal choice for the consumer was a laptop sold at a price of $1,000. Suppose that the consumer purchased an extended warranty protection plan. If they were to purchase one for two years duration, it would be $102 and for three years, it'd be $141. One of the key questions for the consumer is, well, after they buy the laptop, do they buy the warranty as part of the purchase right after the laptop or the extended warranty, or do they wait, or do they just not buy the warranty at all? And I think it's pretty obvious to everyone that if you buy a higher price laptop, your warranty is going to be a higher price more often than that. So this is also a consideration in terms of making the final purchase decision. We didn't necessarily incorporate in this hypothetical experiment. Okay. So this slide just goes into a little bit more detail about warranty. There's a lot of information here and I'll let our audience take a look at it later. But a lot of this comes from our typical computer manufacturers website, like Dell. And like anything, there's a lot of language that is specific to an original limited warranty and different terms and conditions that are associated with the warranty. Some things that I think I should highlight are this idea of onsite services. The fact that more and more now, we have services where a service provider will come to you and do repair or exchange of the product at your location. There's also the customer carry-in and mail-in services, very popular now. Even within the context of Amazon, mail-in services been around for a long time. Customer carry-in is now a part of Amazon's ability to provide service in the context of going to, for example, a Whole Foods Market for return. And then, of course, product exchange. But these onsite service models and mail-in service models are very interesting and much more common today. The other thing worth highlighting is that limited warranty services depend on how you purchase your original warranty policy. So if warranty services are limited, those services are stipulated in the original warranty policy. Okay. So let's talk a little bit about the JMP's reliability and forecasting. So in the context of this particular analysis, we're going to use the life distribution platform in JMP to do some reliability and forecast calculations. And what we're going to try to do is predict future failures of components or that be into the computer as a system to help us get a better sense of whether we should purchase a warranty and at what time in the life of the product. So we can think of this analysis in terms of what we call reliability repair cost. So we want to compare with the extended warranty protection. We want to compare the ready initial warranty to the extended warranty protection. And of course, if the failure rate is too high, then we won't want to purchase warranty protection at all. And if the failure rate is very low, then there would be no need as well for us to purchase any warranty because we would have a product that lasted for a long, long time. You can think of each product or each computer, in this case, has its own unique reliability model. And it's something to think about in this context. So if you've made a particular purchase in this hypothetical scenario, you can construct a reliability model on the basis of this particular computer. Okay. So this slide talks about a nice visual representation of a reliability lifecycle, if you will, or failure rate over time for a particular product. So we basically have three phases. We have a startup and commissioning phase, which is like a burning phase for a product. Then we have a normal operation phase, and then we have an end of life phase. And so these phases can be, we referred to them sometimes as the first phase, the running phase or the burning phase, like the infant mortality phase in the context of survival for life. For clinical studies, the normal active operation phase can be thought of as the phase where random failures may happen, and the end of life phase is really like the wear- out period, the period before the product completely wears out in sales. And so corresponding to these periods, we can consider a general range of limited warranty over time, and then a transition phase somewhere where that limited warranty becomes an extended warranty protection policy. So where we go from a short term warranty, maybe a year, to a long term warranty, two or three years. So like we said, if the failure rate is very low, then perhaps if the product is expected to last more than two years, then maybe you don't need a warranty policy because you may plan to replace a product like this every two years because it's just technology and opportunity cost of price versus the benefit of new technology. So the important thing to think about again in the context of making the initial purchase is looking at the startup and commissioning of a product. If we purchase an original warranty, which is what we commonly do, it usually covers maybe a year of service and it's part of the initial purchase of the product. Okay. So now, what we're going to do is we're going to switch over to a different data set. This is also sample data which will point you to on the community. But we're going to assume that we have a database related to the Dell laptop. And it lists the return months, the quantity return, and the sold month over on the right. And so what we can do is we can graph this information. And the X is referred to the failure rate. The X is in the space of an asset for the failure rate, the reliability model, and how many parts we shipped. That's what these three variables speak to. And based on this probability, we can calculate if they purchase the warranty policy or not against the repair cost. How do we choose a warranty policy based on the failure rate or return rate? So we can use the JMP's reliability forecasting capabilities to do this. So here's a picture of the model that we use to fit this data. So we have a probability on the Y axis, first time in months. And what JMP does is it applies multiple models. So we fit all available models at least once that were non- zero, that were not producing non- negative estimates. And what we can see here is the ranking of potential models to fit this reliability data. And so Weibull is the top choice here based on the AICc, BIC, and negative 2 Likelihood ranking. So we went ahead and went with the Weibull for our analysis, subsequently. Okay. So here's an upper left hand corner of the slide. What we see is the actual Weibull failure probability model fit with its parameters, with primary estimates for beta, Alpha, scale, and location. And then the lower right hand side of this slide, what we did is we estimated the probability of failure at specific months, again, using JMP's reliability tools. And so how do we tie this analysis into something practical? Well, beta is a very important parameter. Beta is a very important parameter for Weibull distribution. So a beta less than one might indicate a product that really doesn't survive phase one; that initial where a burning phase that we showed in the bath tip curve. So in that case, we want to buy a warranty at all. A beta approximately equal to one that would be maybe a product that's in the middle of that curve, that's in that steady state period. In which case then, we wouldn't want to buy a warranty necessarily anyway in that case either, because we have a very reliable product. So we wouldn't necessarily want to invest money in a warranty when we would expect to use it. Only when the beta is greater than one, and you can see in this case, it's significantly greater than one. Do we really want to consider purchasing a warranty? In this particular example, 1.6 being higher than 1.5, probably indicates that the product is entering that wear-out period, that third phase, the productive curve. So if we look at this year, drawing on, say, an example of 35, 36 months, our failure probability is maybe 20%. So that's how we would look at this, right? As you go over and you look at the time, what would the probability failure be in the first two columns of the estimated probability output there? In this particular example, it's not completely clear whether we should purchase a warranty or not, and we likely need more information. If we had seen a beta that was in the three to four range, then that would probably suggest that we would want to purchase an extended warranty because we would anticipate that wear- out would be inevitable. In the next few slides, we're just going to derive a simple decision model to go with this reliability analysis in the spirit of the choice analysis, the choice modeling methodology that we applied in the beginning. So the consumer decision model has to consider a number of factors: survival probability at each month, failure probability at each month, which are, in effect, the same thing. And the market laptop value each month between one year and three months, of course, price depreciation happens, and the monetary loss if not purchasing the extended warranty protection if the repair is needed. And then we want to compare that monetary loss versus the expense of purchasing the warranty. So there's like a cost benefit analysis or a risk analysis that we're making. So we show here, in this slide again, the months after purchase, and then we show the survival probability at a particular month, survival probability of the prior month, and then the conditional failure probability at that month. And so we just use simple conditional probability to calculate that column that I should have indicated with the two all the way over on the right. And what we're doing here is we're using the previous Weibull estimate at each month to calculate the conditional failure probability at the subsequent months. All we're really doing is using the Lag function to generate column one. So it's just the difference between each survival probability in month, let's say, 13 and it's prior month, 12, and 14, and 13, and so on. And then we're just taking the survival probability of the prior month, subtracting the survival probability at the current month, times the survival probability of the previous month. This is conditional probability. Okay. So let's talk about the market laptop value, which is really the second thing we discussed in this extended warranty protection slide. So we can create a simple linear model to model the declination of value over time. And that's, in fact, what we did. And the slope on that model, we use four points to calculate the slope. And the slope on the model indicates the percent drop every month on average. You can see the slope is about 1.3%, so we expect about a 1.3% drop per month on average. And note here that we only really care about the declination after 12 months and beyond because the first twelve months are typically covered under warranty. Okay. So how do we model the cost of not purchasing the extended protection? Well, we can compare to a two year warranty policy, failing at two years as the worst case. And we can see that if we look at extended warranty protection plan at two years versus three years, $102 versus $141, we get a Delta of around $45. Similarly, the cost if not purchasing the two year production plan is $48 at two years versus $95 at three years, and so that difference is on the order of $40 to $50 as well. So what this shows us is that there's maybe a $50 gap between the warranty plan and the estimated cost, and that may be attributed to the services and other fixed costs. But really to make the best decision about whether to purchase a warranty or not, we want to consider the cost of not buying a warranty in this framework, and the magnitude of beta together to make the best decision about whether to purchase or not. Okay. So this is nearly the end of our analysis. And I wanted to just highlight one other thing here. So we can use a forecast capability here to show us how can we determine the return rate, what resources do we need, and a lot of that depends on the performance of service. How good service is. If we have too many returns, if we don't forecast, we may not have enough technicians to do the work. So this is the type of analysis where we're considering the producer. This is from the standpoint of the service provider and the producer. Whereas in the prior analysis, we are considering everything from the perspective of the purchaser or the consumer. So this is really a producer cost model. If they don't purchase the warranty, what's the labor cost, and what's the material cost? So the labor costs to handle all the repairs, and the material cost to replace parts for repair. And so we can see that there's a slight upward sloping trend on the long term repair forecast, and that trend really tells us what the value proposition is. So as a manufacturer, you may be making revenuein the beginning, but then you may lose money in the long run if you're doing significant repair work. Or as I said before, you may not have the capacity to do the repair work that you're obligated to do because of the liability problems with the product. Okay, so just to conclude here. I just wanted to share an overall key learnings here of our project. We use the STEM or STEAMS framework to really break up this project into a number of different elements and apply an interdisciplinary framework. And we use Choice Design, and to help consider survey design methodology as well as an analysis of survey data. We also augmented our design with a reliability performance model to qualify our purchase and whether or not that was a good purchase. Of course, the project in the context of the co- authors, including Mason Chen. Chen was very useful for motivating high school students at SOHS, and teachers to learn new methods. The final thing is one thing that we could consider in the future is increasing the number of levels to choose from, which would bring our model into more of a traditional modeling framework. A modeling framework that's more like a lease regression model or another particular popular modeling framework that looks continuous data. And so in closing, I just wanted to highlight our references and our statistical details; we're going to definitely provide those to you. Thank you very much for your time and I look forward to any questions.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

CLECIM® Laser Welding Machines Finding the optimal parameters for laser welding of steel plates with JMP Stéphane GEORGES R&D and Data Science Project Manager, Dept. of Technology and Innovation Clecim SAS., 41 route de Feurs, CS 50099, 42600 Savigneux Cedex, France Purpose – To be considered good, a weld bead must meet two criteria: it must be free of defects (such as spatter, humpings, underfill, holes, etc.) and resistant (assessed by means of an Erichsen-type cupping test). The search for the optimal parameters for laser welding steel plates is already extremely demanding due to this double constraint. But if, on the top of that, you consider the productivity of the processing line and the quality of the incoming material, then the task becomes a challenge! Approach – And that is precisely this challenge that was overcome with the use of JMP. To achieve this result, many steps were implemented, all of them requiring the use of a JMP platform or feature: Base material strength analysis, qualification of the two plates to be welded [Graph builder, Map shape, ANOVA, Dashboard] Synthesis of the visual observations, production of the weld defects map, which determines a study area of irregular shape where the weld seam is flawless. [Graph builder, Multiple pictures hover label] Weld strength analysis and optimization on a non-homogeneous material and on the defined defect-free zone, given as a set of candidate points. [Custom design, Split-plot, Covariates, Uncontrolled factor, Fit model, Prediction and Contour profilers] Findings and Value – For the given material, the objective was achieved since all the steps allowed to propose and validate a set point with a maximum productivity and a good weld, both defect-free and resistant, JMP being pliable and able to adapt to all the constraints of the process and the material. Key words: JMP, laser welding, design of experiments, DoE, covariates, LW21M, LW21H Hello everyone. Thank you for attending this experience sharing session on JMP. My name is Stephane Georges, I'm R&D Project Manager at Clecim, and I'm very keen on data science. Today, we'll talk to you about the design of experiment methodology, and more specifically, about the case we encountered when trying to find optimal parameters for laser welding process. During this presentation, I will show you how we use JMP for various platform and how JMP adapted to the reality of the field by taking into account a very irregular study area and a very imperfect study matter. Without further delay, I will start my presentation by telling you who we are and what we do. Clecim is an engineering and production company of equipment for the steel industry. We are located in Montbrison in France in the surrounding of Lyon. The history of Clecim is not new, as we celebrated five years ago our 100th anniversary. The area of the site is about 12 soccer field where work 230 employees, mainly composed of managers and technicians. A s we like statistics when working with JMP, here is the first one. Our population is composed mainly of men, about 80 percent and 20 percent of women. Concerning Clecim activities: Our first activities is studies and consulting activities for our flat steel producer customer. We supply individual machine, or we supply a complete production line such as pickling line, annealing line, galvanizing line, painting line and so on. We also have an activity of services for the furniture or spare parts, for export missions, maintenance, and so on. I put on this picture a typical layout of a galvanizing line. This is to give you an idea of such processing line. This one is dedicated to the automotive market and the length of such an equipment is about half a kilometer, so a very huge industrial plant. When I talk previously about a machine, I had in mind rolling equipment such as rolling mill, plate levellers, automated strip surface inspection system, and even laser welding machine. This is on this last equipment that we are going to focus on right now. I will now talk to you about autogenous laser welding process. Autogenous means without filling wire. I will talk also about the parameter and factors that govern this process. But first of all, I would like to introduce our machine, the subject of our study. On the left part of the slide, you can see our welding machine. In fact, not only our welding machine but it's containment, the machine is inside the containment for safety reasons because we are using laser. The dimension here of the door gives you an idea of this welding machine of scale one. This is a huge industry, our welding machine. On the right part of this presentation, you can see a partial inside of this welding machine, where you can see the clamps and the [inaudible 00:04:37] of the machine. Inside, you see the top portion of the strip, the head and the tail of the strip, that will be, first of all, cut also with laser and finally brought together in order to be welded. I will now talk to you about our target and constraint. Of course, our objective, our target is to have a good weld. To do that, we need to achieve two objectives. The first one is to have a weld seam which is defect- free. Here on this slide, I put you an example of such a weld. This is picture number one. You can see on this picture that this whole thing is quite nice without any defect. When I talk about defect, here is a list of the typical defect that we encounter when trying to weld with a laser. Typically, we could have some patterns. This is picture number two. This is a top view. In such a case, this is the molten material, which is ejected from the top of the weld. We could have also chain of pearls. This is picture number three. This time this is a bottom view, and this is some droplets at the bottom of the weld seam. We could have also other defects, such as humpings, underfillings, or even holes. This is picture number four here. And here, typically, this is the case when we have a very low travel speed and a very high power density. Instead of welding, we are drilling at the material and we create some holes. Of course, this is we absolutely want to avoid. Otherwise, we will decrease the resistance of our weld. This is a transition for the second objective because we want not only the weld seam to be p erfect, but we want it also to be resistant. This is evaluated via an Erichsen type cupping test, so this I will describe it a little bit later. Our target is to have a trend as close as possible to the one of the base material. I will now talk to you about the last welding parameters, the factor governing the process. On this left part of this slide, I put you a very schematic view of the process. In gray at the bottom, you can see the two pieces of material that we want to weld together that can be of the same nature and thickness on it. In yellow, this is laser welding head that is connected in blue to its laser source. In order to imagine the kind of power that we use for such an application, imagine that when you use a laser pointer, typically for a presentation, such kind of device has a power of just one milliwatt. Here the last source we use has a power of 12 million times. It is 12 million times more powerful that's such a small device. Just to explain that we have a very huge power. We need very little power to cut our material and to weld also this material. On the right part of this presentation, I could use the process parameter, the typical process parameter, which can be, first of all, the laser power, the travel speed of the welding carriage, the focusing distance, the gap between the plates, the thermal treatment that we can apply afterwards, and so on. But in fact, for simplicity reason, in the rest of this presentation, we will focus only on the two main one, which is the laser power and the travel speed. We will also consider that the materials are identical and of the same thickness. You will see that just with these two parameters we will have enough to do. Okay, so the picture is set. We have two targets. One is to have a weld seam free of defect and we want also to have its resistance. We are now going to focus on our case study towards a good weld. Our first target is to have a weld , which is defect- free. We are going to search for what is called weldability lobe. To do that, we need to get some data. To get some data we will use the so called Power JMP procedure. In that case, nothing to do with JMP, even if JMP is a powerful software. But this is how the procedure is called. The picture on the bottom gives you an example of such a procedure. At a fixed speed, we will perform 11 successive Power JMP. In that case, we will switch from two kilowatts to eight kilowatts to three kilowatts and so on. The target is to reduce the number of options we have to do, and in just one weld, we will have 11 sample and we will have 11 observations to do. Afterwards, we will visually examine the upper part of our bead, the lower part of our bead for each slot of this sample. All of these data are collected into JMP and we will use the Graph Builder platform in order to display this map. This is what I'm going to show you right now. I go to my JMP journal here. We will have four steps to follow. This is our first step, building the welding format. I will open my table. I collected all the data in this table. I have all my parameters here, so the laser power, the welding speed. In this column, I inserted my visual observation. Is there any penetration? Yes/N o. Do we have material loss? Yes/N o. Humping? Yes/No, and so on. A t the end of this file, I also put, as you can see, I had two additional columns of expression vector type where I have inserted the picture of all my observations. As you can see here also, I requested to have this information displayed in the study area. Now, we are ready to open our Graph Builder. Here I can learn it, okay? All the data has been collected in this map. Here you can see that on the X- axis, I put my welding speed. On the Y- axis, I have my laser power. I have associated for each defect color or shape. A lso, we can have an association of the color and shape in that way, which is a convenient way because we can overlay four different type of defects at the same time for each point. This is what I'm going to show you. For instance, if I take this blue point here, according to the legend, we have top spatters. This is exactly what my pictures show you. Here, this is the picture of the upper part of my weld, here is a picture of the lower part of my weld. Here we can see that we have effectively top spatters, whereas the bottom part is defect- free. For instance, if I take another one, if I take this purple one here. So purple, this is the association of the blue defect and the red defect. We have top spatters and bottom spatters, which is effectively what we can see here on that picture. This is a convenient way to see if effectively. I have no mistake in order to know the magnitude of the defect. This feature using the pictures is also very convenient because, for instance, if I take this point here and I lock the pictures, and if I take this additional point here and I lock also the pictures, we can see that for constant laser power, I can compare the pictures and see what are the effects when varying the welding speed. In that case, when we increase our welding speed, we can see that the width of our weld seam decrease both on the top and both on the bottom. This is a convenient way, let's say, to dip into the understanding of our process. I will close that. I'll come back to my study case. We are interested in the good weld area. This is the area that I'm going to highlight here. This is the black area. Okay, like that. Okay, this is our area of interest. Now, what we want to do, let's say, to investigate the behavior of our weld seam from the resistance point of view on that typical area. Of course, we want to do it with a minimum number of tests and we will perform a design of experiment on this very irregular study area. Okay, so I go back to my presentation. But before answering into the conception of the design of experiment, we have another interesting topic to do with JMP, because we have, first of all, to study the strength of our base material. We have to evaluate the basic strength of the material via an Erichsen- type cupping test. Why are we doing that? We have three targets, three objectives. The first one this is to establish a reference from the strength point of view. In that way, we will be able to compare the resistance of our base material with the resistance of our weld seam. This is the first point. The second point is to be able to compare the two pieces of material that our customer sends us. We want to be sure that these two pieces have the same behavior. For that, we have to ensure that they can be comparable. The last point is that we want also to check that the plates are homogeneous from the resistance point of view and that they do not present any resistance profile in their widths or in their lengths. To do our Eric hsen-type cupping test, we will do that on the base material. This is what is mentioned, what is highlighted here in the three first pictures. We do not have any weld. We're just performing this Erichsen- type cupping test. For simplicity reason I will call this procedure ball test in the remaining part of the presentation. These three ball tests, we do them on three different positions of the material, one located at the center of our sample, one located on what we call the drive side on the machine, and one located on the operator side of the machine. For one sample, we do free tests. What is a ball test? In fact, this is explained in the pictures located at the bottom of this slide. We simply take a bowl made in titanium, and we will press it from the bottom and we will register the deformation of the material, and we will register the breakage force. We do that for our two plates, and we register all of that in JMP and analyze the results in JMP. We will use the distribution platform and the Fit X by Y platform. This is what I'm going to show you right now. I go back to my JMP journal. This is our second step, analyzes the base material. I will open my file. Here, I put all my data. My plate ID, plate number 1 and 2. For each plate I do that twice. For each sample, we perform the measurements at three different locations, and here are the recorded value, the recorded strength. We will analyze all of that and I store everything into a dashboard. First of all, this is interesting to see visually our result. I will focus, first of all, on this custom map shape. In blue, you have the data for the first plate, where I have my first sample and second sample. For each sample, I have my free ball test, one located on the operator side, one located in the center, and one located here on the drive side. We have the same for the second piece of material. What we can see here is that the resistance goes through the following rates, so from nine to 10.4 tons. Nothing really particular to see, except maybe that here in the operator side, we can see that we have on the same side the external value. Here is the lowest value and here is the maximum value, so maybe that will be something to look at but we will come to this a little bit later. Our target is to perform an ANOVA in order to see if our two plates can be considered as comparable. But before doing an ANOVA, we need to ensure that our data follow a normal distribution and that our variances can be considered as equal, so this is what we are going to do right now. Here are the distribution for plate number 1, for plate number 2. Okay, I know that I do not have a lot of data, but we will consider that we have enough to perform the test. We will look at the two Anderson-Darling coefficients here. What the p-value tells us is that we cannot reject the hypothesis that the two plates can be comparable so that's good. Then concerning the variances, so here we use another platform, but we first go at the end, we perform the variance analysis here, and we will look at the F Test, and the F Test tell us that we can consider as our variance are equal. As our data are normally distributed on our variant article, we can apply safely our ANOVA. This is what is mentioned here. On the top part, you have the drawing. Here are the associated data. I will not focus on the data. We will just have a look at the pictures. Here, what we can see, this is the extremities of the two diamonds overlap. We cannot reject the hypothesis that our two plates are different, so this is good. We will now reach the conclusion that our two plates are equivalent. We can now aggregate all the data. This is what is done here in the distribution. We put all our data together, and finally, we have a global resistance of our plate of 9.76 tons plus or minus 0.17 at two standard deviations. This is the first point and we will use this information a little bit later. Another interesting thing, so this is what is mentioned here. We can also perform the ANOVA taking into account the position. And this as we have previously observed, we can see that the run variation on the operator side is a little bit higher compared to the drive side and the center of our plate. We want to understand a little bit why such things happen. To do that, I will come back to my presentation, we will have a look at the plate. We are studying at the moment. Here is the appearance of our sheet metal. What we can see is that on the drive side here, our plate is nearly flat, I would say. But on the contrary, on the operator side, we clearly see that the plates have some waves. I do not know exactly what is the history of this material, but we can clearly imagine that there was a trouble at the rolling m ill or the plate leveller and that the higher force has been applied on the operator side, leading to this kind of periodic modification of the resistance. This is a new constraint because we have to take into account this new information in our design of experiment. I will sum up all the information we have before building our design of experiment. First of all, I remind you that we have a very irregular study area. This is a black area that is mentioned here in the drawing. The traditional way to deal with such things in JMP would be to fill up a linear constraint. But here due to the shape of this area, it's a little bit difficult. Instead, we prefer to use the Candidate Points technique, which is called also Covariates engine. I remind you also that we have an inhomogeneous plate and to deal with this phenomenon, we will have to introduce a few parameters into our design of experiment. First of all, we have to take into account the strength variation in the width. To do that, we will introduce a categor ial parameter, a 3 levels- categorial parameters, and the three levels are drive side, center, and operator sides. In order to deal with the periodic variation of the resistance along the length of the plate, well, this is a little bit difficult, because, in fact, we do not control this parameter. We have to be on this variation. For that, we will introduce the weld position from the head of the plate, or in millimeter, and we will introduce it as an uncontrolled parameter. Finally, this is not finished. This is what you can realize, what you can see in the last picture at the bottom. This is typically here a picture of a weld, where are located above three ball tests. Well, these ball tests are not independent. They belong to the same treatment. They belong to the same weld. They are at the same weld position. We are in the presence of split/ plot design, where we have hard and easy- to- change parameter. This is a lot of constraints we have to take into account. Now, I will show you how to do that with JMP. I can go back to my JMP journal, so this is our first step. But I will show you from the beginning and I will go back here to this step. I come back to the file I had previously. I will select here all the rows with a Good Weld. I will also select another power column, my welding speed column. In the table here, I will extract a subset of this table. I will extract selected line. I will extract the selected column here. Okay, I will build my subset of Candidate Points. But here is the tricky thing because, in fact, as we have a design also, I need to tell them that you will have the possibility to select three times each point. I will multiply by three this number of points. There is probably a lot of way to do that. In my case, I will just create three columns, one call it drive side, one call it center, one call it operator side, and I will just stack all these using that three columns. Here I have created my set of Candidate Points. With the Graph Builder, I will check that everything is okay. In the Y- axis, I will put the laser power. On the X- axis, I will put the laser speed. Here, we can recognize the shape of our irregular study area. I will also put here the label into the color sections. Each of my points has been multiplied by three. This is exactly what I wanted. I can use this as a starting point. This is a set of my Candidate Points. I could have additional points. I could have done a little bit all of that. I could have added some points here, for instance, and so on. But to be honest, the discretization steps are fine for me. I will keep all the points like that. I will open now my design of experiment menu, go to my Custom Design platform. In the response, I want to measure the strength of the material. In the Factors, I will add my first parameter, Laser power and welding speed as Covariate. I select Covariate, I select laser power and Welding parameter. Automatically, JMP fills up the lowest and highest value for these two parameter. I will add also my 3-level categorial parameter. I call it side, and my level drive side, center, and operator side. I do not forget also to add my uncontrolled parameter. This is the position of the weld. This is an uncontrolled parameter. I do not know the limits, so I put nothing in these boxes. I also do not forget to change here in order to take it into account the speed/plot feature. I mentioned here that my laser power and welding speed are hard-to-change compared to the side. Everything is correct here, so we can run. Concerning the constraints, I do not need this area, because I already took into account my constraints when selected my covariate, so I do not need here. Concerning the model, I immediately choose RSM. But in that RSM, I will suppress the interaction of the laser power with the position and the interaction of the position with the welding speed because there is clearly no interaction at all. I will immediately click on the Make Design button because it will take some moment. Here we can see that JMP proposed me to perform eight tests with 24 measurements. This is fine for me. I keep this default parameter. I make my table here. Okay. This is the defect concerning to laser power, welding speed, side. I have the column , I will record the position of my welding speed, where I will record the strength, so everything is okay. I will never visualize the point into the Graph Builder. I put, once again, the laser power in the Y- axis, the welding speed will in the X -axis. Perfect, and we read curve. I will also put the side here into the Color area. I will add some details. Okay, here we can see that this is the point that has been selected by JMP within the framework of our Candidate set. For each point, I have to perform three measurements, on drive side, center, and operator side, and JMP wants me to perform this test twice. Okay , so this is what we have done. I will now show you the result. I go back to my JMP journal here. This is our last step, the result analysis. I will open the associated table. Here, this is the same table as previously. I have recorded the positions. I have recorded the strength in absolute value in terms. I have inserted two columns here. This is our reference strength, the strength of our base material that we have previously determined. This is 9.76 tons, because, in fact, I will use it in order to create this extra column, this is the strength, but in percent compared to the base material. This is on this last column. This is the first column that we will take into account in our optimizations. I store my analysis into each column here. Here are the results. As a reminder, I put on the right part. the irregular area as a reminder. Here are the experimental points that we have performed. I have colored the point using the strengths in percent. I use the Fit Model platform in order to create my model. Step by step, I have suppressed the non-significant interaction of our parameter using the p-values here. Finally, I have a model with an explicative power of two of 96 percent, which is quite good because, in fact, it means that only four percent of the wall variation escape to our prediction power. Concerning the collinearities , if I look to our VIF, our variance inflation factor, all of them are below three. We can be now confident in our model and we can use it in prediction. We can go to the prediction profiler here. First of all, I will focus on this part, on the interaction of the position with the side. What you can see here, if I move the position of the weld, it seems that the resistance is not sensitive to the positions on the drive side and center. But on the contrary, on the operator side, this one is clearly influenced by the positions. This is exactly what we have seen in our plate. This is a modeling of our waves. Here we do not see this kind of shape because this can be easily explained. Our weld are not so huge. It's only a sample of six centimeters. We do not consume a lot of material. We do not go along the wall with that shape. We go from the bottom to the top only. We are quite happy because we were able to analyze and to model correctly this behavior. This is interesting because we can now have access to the pure effect of the laser power and welding speed. This is what is mentioned here. For this particular material, we can conclude that we can increase the resistance of this material by decreasing the laser power or by increasing the welding speed. Now what we have to do this is to determine an optimal point using all the information we have. To do that, we prefer to do it using the control profiler. This is what I have mentioned here. In this control profiler, once again, I put in the X- axis, the welding speed, in the Y- axis, I have the laser power. I have reproduced my area, my irregular area where I am defect- free. To do that, I have simply implement a script. Here is a list of points, and I just asked JMP to use this point and to draw a polygon. I have my black area where I am defect- free. On that drawing, I have also inserted the ISO resistance curve. Here in red, you can see the values of this ISO resistance. We can see that, here, we go from 50 percent to nearly 100 percent. Before doing the optimization, I will add another constraint. It's not enough. From the productivity point of view, we want, of course, to go as fast as possible and we want to have the highest welding speed. Using all that information, we have selected the point at six kilowatts and 11 meters per minute. This is the point that has been located here. Why? Because this point is located in the black area, where the w eld seam is free of defect. We can see using our model that we expect this point to have at least a resistance of 90 percent of the resistance of the best material. What is interesting also is at this point we have enough safety merging around this point. Of course, we have tested this point and this is the result. But I'm going to show you right now. I go back to my presentation, which is located here. This is the result of the optimal point we are choosing. First of all, on this slide, you can see on the left part, the upper w eld bead pictures, on the right side, the lower weld bead pictures. You can see that the weld seam are free of defect. We have no patterns, we have no droplet, no chain of pearl, no holes, et cetera. This is what we wanted. From a resistance point of view, first of all, if we focus on the pictures, we can see that this is the material that breaks and this is not the weld seam that opens, so this is the first good point. Concerning the resistance, we can see that each of them are higher than 90 percent. This is exactly what we wanted, let's say that we have achieved our target. This is the end of this presentation. This is now time to conclude. First of all, I go back to my JMP journal, I would like to mention that with this presentation, you have the possibility to download an article that will be located on the website. Here you have a full article explaining all the case studies. I have added additional material. If you are interested in knowing more, please feel free to download it. A s a conclusion, I would like to mention that if some of you are interested in knowing more about covariates, I would like to mention two available sources. The first one is an article that you can find on the JMP user community entitled "What is a covariate in design of experiments?" Also from the same offer, you have a webinar entitled "H andling Covariates E ffectively when Designing Experiments." To conclude, I put here a quote from Mark Twain that humorously tells us that "Facts are stubborn things, but statistics are pliable" Inspired by Mark Twain, I would like to say that facts are certainly stubborn things, meaning complex, but don't panic because, in fact, JMP can easily adapt to the reality of the field. In my case, he was able to adapt to a very irregular study area and also to a very imperfect study material. This is the end of my presentation. I will now answer your question, so please feel free to ask. If we run out of time, I mentioned here my contact information, so please feel free to contact me, and so I'm waiting now for your question. Thanks a lot. Bye bye. 1. Background and introduction Steel strip manufacturing ever reinvents itself by propositing new metallurgical concepts, requiring to tackle technical limitations of production systems. As a provider of mechatronic solutions for steel plate processing, Clecim SAS recently expanded its laser welder line with a next-generation machine capable of cutting and welding heavy plate using a 12kW laser source. Addressing the usual drawbacks in maintenance, operation and safety of current welding system based on mechanical cutting and CO2 laser welding, the newly developped LW21H (Heavy) welding machine benefits from a smarter approach by processing thicker strips up to 9 mm with solid-state laser cutting and welding. This new generation of welders, heir to Clecim SAS' 20 years of experience in welding and in particular its little sister - the LW21M (Medium) - pushes back the current limits of performance and technological drawbacks observed in solutions for thicker materials. It is materialized by a 1:1 scale pilot designed, manufactured and tested in Clecim SAS workshops [Figure 1.] Figure 1b - A partial view of the inner part of the machine. Head and tail of the two plates will be cut by laser technology and then welded together.Figure 1b - A partial view of the inner part of the machine. Head and tail of the two plates will be cut by laser technology and then welded together. Figure 1a – An external view of the containment of the heavy laser welder at Montbrison workshop. The size of the door gives an idea of the dimensions of this industrial welder.Figure 1a – An external view of the containment of the heavy laser welder at Montbrison workshop. The size of the door gives an idea of the dimensions of this industrial welder. In 2019, the laser cutting process was extensively studied and the use of Machine Learning techniques allowed for the conception of a model able of delivering robust cutting presets across the thickness range. Today, the focus is on the laser welding process and the acquisition of high-quality data that will soon allow the creation of a welding model, the final step for a completely automated machine. To achieve a good weld, two criteria must be met: a weld seam free of defects (such as spatters, droplets, etc.), and a good strength. To reach this result on a given material, many steps have been followed: The determination of the welding flaws map and the weldability lobe, area where the weld seam is defect-free The determination of the base material strength to ensure that the pieces of material are identical and homogenous The analysis and modeling, via a DoE of the weld seam strength on the previously determined weldability lobe, which usually has a highly irregular shape Let’s now dive into the details of these exciting steps, all of them requiring the use of a JMP platform. Notation M Material type F Focusing distance H Material thickness G Gap between the plates P Laser power T Thermal treatment V Travel speed of the welding carriage 2. Laser welding process and factors The welding process is made of 3 parts: the two plates to be welded which can be of the same nature and thickness or not, the laser welding head mounted onto a travelling carriage and connected to its 12kW laser source. To give an idea of the delivered power, a classic laser pointer used for a presentation has typically a power of 1mW. In comparison, the laser source used by Clecim SAS to cut and weld the pieces of material is 12 million times more powerful. Generally speaking, the influencing parameters of laser welding belong to two categories, namely those related to the material to be welded itself, such as its nature M and thickness H, and those related to the process, such as the laser power used P, the speed of the welding carriage V, the focusing distance F, the heat treatment T or the spacing between the sheets G. To a lesser degree, other parameters are involved such as the inclination of the laser welding head, the type of shielding gas, its pressure, etc. Within the framework of this paper, only the used laser power P and travel speed of the welding carriage V will be considered. The materials to be welded will be identical and of the same thickness. To put it in a nutshell, for the given pieces of material (M, H), two factors (P, V) have to be optimized with the goal of getting a flawless and resistant weld seam. 3. Weldability lobe Figure 2 – Welding flaws map – JMP Chart Builder is used to view the weld defect map. The major defect areas can easily be recognized: partial penetration (yellow), holes (orange), spatters (blue, purple), chain of pearls (horizontale stripes), defect-free area (black). Pictures of the top and bottom weld seam are displayed in the tooltip area when moving the mouse over.Figure 2 – Welding flaws map – JMP Chart Builder is used to view the weld defect map. The major defect areas can easily be recognized: partial penetration (yellow), holes (orange), spatters (blue, purple), chain of pearls (horizontale stripes), defect-free area (black). Pictures of the top and bottom weld seam are displayed in the tooltip area when moving the mouse over.The first step of the experimental approach consists in performing tests in order to build a map of welding defects and thus determine the weldability zone, i.e. the defect-free zone. Depending on the thickness of the material, the number of tests to perform can quickly become important. In effect, the goal is to test all the pairs (P, V) and to visually observe the quality of the weld bead to know if the combination (P, V) generates a defect or not. In order to drastically reduce the number of tests and to save time, the so-called “power jumps” procedure is used. In a single trial, at fixed speed, 11 power jumps, from 2 to 12kW in 1kW steps, are carried out giving the possibility to perform 11 tests in one. Regarding the welding speed, steps of 2 m/min were used from 3 to 18 m/min. In the end, the upper and lower parts of 88 weld seams were visually inspected and qualified. The results were stored in a JMP table and evaluated using the Graph Builder platform [Figure 2.] The welding speed V is shown on the x-axis and the used laser power P on the y-axis. For a given speed, we find the 11 visual observations corresponding to the 11 power jumps of the test protocol. Thanks to the association of a color and a shape, in one combined, it is possible to represent four welding flaws at the same time and to visualize hence the major defect areas in this way. For each pair (P, V), pictures from the top and bottom weld seam have also been taken and stored into two expression/vector columns so that they can simultaneously appear in the tooltip area. By moving the mouse over the points, the pictures are displayed. This functionality allows to easily compare the influence of a factor change on the weld bead facies and thus to progressively enter into the understanding of the laser welding process. 4. Base material strength analysis Before going further in the analysis of the welds, it is necessary to evaluate the strength of the base material, and this for 3 reasons: The first reason is to establish a reference strength so thatwe can make comparisons. The second one is to make sure that the 2 plates sent to usby our customer are comparable. And the third one is to make sure that the plates are homogeneous and that they do not have any resistance profile in their width for instance. To do that, Erichsen-type cupping tests on plates without any welds. Stamping is done via a ball and the breakage resistance is automaticallyrecorded. The protocol provides for three measurements in the width of the plate. Positions are respectively DS (drive side of the welding machine), C (center) and OS (operator side). The various results are stored in a JMP table and summarized in a dashboard [Figure 3.] Figure 3 – Base material strength analysis The dashboard is composed of various JMP platforms: Graph Builder, Distributions and ANOVA. The custom map shape[1] of the Graph Builder displays the two samples corresponding to each of the two plates and the position of the various cupping tests colored by strength. In the ANOVA, the overlap of the two diamond tips demonstrates that the plate can be considered as identical. The chart on the right shows that the strength variance is higher on operator side (OS). Once aggregated, data from the bottom distribution presents an average strength is 9.76±0.18 tons (at 2σ).Figure 3 – Base material strength analysis The dashboard is composed of various JMP platforms: Graph Builder, Distributions and ANOVA. The custom map shape[1] of the Graph Builder displays the two samples corresponding to each of the two plates and the position of the various cupping tests colored by strength. In the ANOVA, the overlap of the two diamond tips demonstrates that the plate can be considered as identical. The chart on the right shows that the strength variance is higher on operator side (OS). Once aggregated, data from the bottom distribution presents an average strength is 9.76±0.18 tons (at 2σ). In summary, the two plates to be welded can be considered identical, but further investigations are needed to understand why the strength variance is higher on the operator side. Figure 4 – Appearance of the plate – The plates present a relatively flat aspect on the drive side and waves on the operator side. The history of the plates is unknown but there must have been a rolling of planishing issue with a higher force applied on the operator side, which created this appearance and a periodic modification of the strength.Figure 4 – Appearance of the plate – The plates present a relatively flat aspect on the drive side and waves on the operator side. The history of the plates is unknown but there must have been a rolling of planishing issue with a higher force applied on the operator side, which created this appearance and a periodic modification of the strength. To understand the differences in resistance on the operator side, it is necessary to pay attention to the visual aspect of the plate[Figure 4.] Due to potential force variations during its treatment, the plates are inhomogeneous in term of strength in their width and length. 4. Weld bead strength analysis The construction of the test plan requires taking into account all the various constraints, 4 in number: The first constraint is related to the irregularly shaped region[4-Ch.5] of the weldability lobe. The traditional way to do it would be to delimit the study area using multiple linear constraints. Although possible, it is the technique of the candidate points, also called covariates[2,3,4-Ch.9] in JMP, that has been chosen for the simplicity sake. The second one, due to the plate inhomogeneity, is related to the strength changes in the width. To take this effect into account, a 3 levels (DS, C, OS) categorical parameter is envisaged. The third one, also due to plate inhomogeneity, is related to the periodic and incurred strength changes in the length. This parameter cannot be controlled but it must nevertheless be considered in the future test plan. Finally, the fourth one is related to the fact that the 3 values of the categorical parameter are not independent since they belong to the same treatment (i.e. weld). Subsequently, a split-plot design[4-Ch.10] with parameters hard or easy to change has to be considered. Figure 5 – Building of the custom design of experiments – The Custom Design platform allows the creation of a completely customized test plan. The Responses part provides the list of responses to be optimized, in this case the goal is to look for the maximum strength. The Factors part presents the way how the four constraints have been addressed. As the position is uncontrolled, no values are input into the limits. The Model part displays all the factors and interactions considered in the model. RSM (Response Surface Methodology) is used, the interactions between the laser power, the welding speed and the side have been removed as they have been considered not significant. Finally, the Design Generation part proposes 8 trials and 24 measurements.Figure 5 – Building of the custom design of experiments – The Custom Design platform allows the creation of a completely customized test plan. The Responses part provides the list of responses to be optimized, in this case the goal is to look for the maximum strength. The Factors part presents the way how the four constraints have been addressed. As the position is uncontrolled, no values are input into the limits. The Model part displays all the factors and interactions considered in the model. RSM (Response Surface Methodology) is used, the interactions between the laser power, the welding speed and the side have been removed as they have been considered not significant. Finally, the Design Generation part proposes 8 trials and 24 measurements. The creation of the custom design of experiments is explained in [Figure 5.] A total of 8 tests and 24 measurements is lastly considered. The test plan is executed and for each triplet (P, V, Side) the following data are recorded: the position of the weld (in mm, from one end of the plate), the value of the strength (in absolute and in percent of the base material strength). The strength of the welds is then modeled using the Fit Model platform [Figure 6.] Figure 6 – Building of the custom design of experiments – The results are presented into a dashboard. The irregular shape of the weldability lobe is reminded in the top right chart. The experimental points, proposed by the Custom Design platform, and the associated strength values, in percent, are summarized in the bottom right chart. Finally, the Fit Model platform on the left displays the modeling result. An explicative power R2 of 96% has been reached, meaning that only 4% of the variations escape its predictive power. The Effect Summary shows that the main effects (laser power, welding speed and position) are significant. The side factor is not directly significant, but becomes so when associated with the position. The VIFs (Variance Inflation Factors, not displayed here) have all a value smaller than 1.6, showing no multicolinearity issue (no linear relationship among two or more explanatory variables exists).Figure 6 – Building of the custom design of experiments – The results are presented into a dashboard. The irregular shape of the weldability lobe is reminded in the top right chart. The experimental points, proposed by the Custom Design platform, and the associated strength values, in percent, are summarized in the bottom right chart. Finally, the Fit Model platform on the left displays the modeling result. An explicative power R2 of 96% has been reached, meaning that only 4% of the variations escape its predictive power. The Effect Summary shows that the main effects (laser power, welding speed and position) are significant. The side factor is not directly significant, but becomes so when associated with the position. The VIFs (Variance Inflation Factors, not displayed here) have all a value smaller than 1.6, showing no multicolinearity issue (no linear relationship among two or more explanatory variables exists). The resulting model being of good quality, it can be used in prediction. After correcting for the effects of weld position and sides, the trends attributable to laser power and traveling speed are clearly visible in the Prediction Profiler [Figure 7.] Figure 7 – Checking the model's behavior with the Profiler – The upper profiler refers to low position values, the lower to high position values. The model response (strength in %) is shown on the y- axis, the factors on the x-axis. Weld after weld (increasing position), the strengthes on the DS and C sides remain mostly unchanged while the strength on the OS side changes dramatically, as observed visually. This phenomenon being well modeled, it is now possible to access the pure effects of laser power and welding speed. The weld strength increases when the laser power decreases and the traveling speed increases.Figure 7 – Checking the model's behavior with the Profiler – The upper profiler refers to low position values, the lower to high position values. The model response (strength in %) is shown on the y- axis, the factors on the x-axis. Weld after weld (increasing position), the strengthes on the DS and C sides remain mostly unchanged while the strength on the OS side changes dramatically, as observed visually. This phenomenon being well modeled, it is now possible to access the pure effects of laser power and welding speed. The weld strength increases when the laser power decreases and the traveling speed increases.For the considered material, there does not seem to be any interaction between the laser power and the welding speed. The weld strength therefore increases with the travelling speed and when the laser power decreases. However, that being said, the work is not over yet. The limits of the weldability lobe must also be carefully considered in the search for an optimum. On the basis of the Prediction Profiler alone, this is not easy, so it is the Contour Profiler's turn to play! The use of the Contour Profiler allows to superimpose the iso-resistance curves from the strength model with the weldability lobe [Figure 8.] Finding the optimal point requires locating a point that is not only within the weldability lobe but that also has the highest strength. Figure 8 – Optimization with the Contour Profiler – The Contour Profiler displays the welding speed on the x-axis and the laser power on the y-axis, with the position and side values fixed. Arbitrarily, the side value was set to DS. As for the position, it was set to the latter. The weldability lobe where the weld bead is free of defects was reproduced in black using a script and the polygon drawing function.The iso-resistance curves of the model, in red, are also plotted. The associated resistance percentages are also displayed in red. The welding speed and laser power sliders are set to the coordinates of the optimal point, materialized by the black cross in the center of the graph.Figure 8 – Optimization with the Contour Profiler – The Contour Profiler displays the welding speed on the x-axis and the laser power on the y-axis, with the position and side values fixed. Arbitrarily, the side value was set to DS. As for the position, it was set to the latter. The weldability lobe where the weld bead is free of defects was reproduced in black using a script and the polygon drawing function.The iso-resistance curves of the model, in red, are also plotted. The associated resistance percentages are also displayed in red. The welding speed and laser power sliders are set to the coordinates of the optimal point, materialized by the black cross in the center of the graph. If we add to this the fact that the welding speed should be as fast as possible for a maximum productivity, the coordinate spot (11 mpm, 6kW) proves to be ideal. Not only does it meet all the above criteria, but it also offers a satisfactory safety margin for an industrial process. Of course, these settings have been tested. The results are presented [Figure 9 and Figure 10.] In summary, the weld seam has a defect-free surface with a strength across the entire width comparable to that of the base material. The objective has been achieved! Figure 9 – Optimum preset and weld seam appearance – The figure shows the upper (left) and lower (right) weld bead facies. The latter are free of the main welding defects.Figure 9 – Optimum preset and weld seam appearance – The figure shows the upper (left) and lower (right) weld bead facies. The latter are free of the main welding defects. Figure 10 – Optimum preset and weld strength – The figure shows the results of the 3 Erichsen-type cupping tests performed on the DS, C and OS sides. Visually, it can be seen that it is the material that breaks and not the weld. Moreover, all the tests show a strength level comparable to the base material one.Figure 10 – Optimum preset and weld strength – The figure shows the results of the 3 Erichsen-type cupping tests performed on the DS, C and OS sides. Visually, it can be seen that it is the material that breaks and not the weld. Moreover, all the tests show a strength level comparable to the base material one. 5. Conclusion Finding the optimal laser welding parameters for a given material is not easy. Fortunately, JMP offers a suite of platforms that, in one combined, provide a rigorous approach to achieving our goal. The use of the Graph Builder, Personal Map Shape and Dashboards allowed us to visually organize our data both in terms of welding defects (welding flaws map, weldability lobe) and strength (Erichsen-type cupping tests). The ANOVA and Distribution platforms were used to make informed decisions about the equivalence of the plates to be processed and their level of strength. Once the weldability zone was determined (zone where the weld beads are free of defects), the strength of the weld bead was studied using the design of experiments methodology. In this paper, only 2 parameters were considered (laser power, welding speed). The Custom Design platform allowed for a high degree of customization of the tests in relation to the encountered constraints. The highly irregular shape of the study area gave the opportunity to use the candidate point method (covariates), in addition to other features such as split-plot design and uncontrolled factors. The modeling of the weld strength via the Fit Model platform allowed not only to understand the involved physical phenomena but also to proceed to a multi-criteria optimization via the Contour Profiler. Finally, the objective was achieved since all the steps allowed to propose and validate a set point with a maximum productivity and a good weld, both defect-free and resistant. This step is part of an extensive, very high-value data acquisition program that will allow, just as it did in 2019 with the laser cutting process, the development of a laser welding model that will provide robust welding instructions, regardless of the incoming product, the final step to fully automate the machine. 6. About Clecim SAS Clecim SAS, based in Montbrison (Loire), joined the Mutares group on April 1, 2021. It is an engineering and production company, bringing its expertise in services and manufacturing in particular for the metallurgical industry. Its main activity is the operational support of the performance of its flat steel producer customers, in particular for the automotive market. This support takes the form of studies and advice on the improvement of their production tools, the supply of special machines to optimize performance and, if necessary, the supply of complete production lines based on the latest technologies. For decades, Clecim SAS has been promoting innovation in the steel industry and is constantly looking for new solutions to provide metal producers with state-of-the-art equipment, allowing them to gain a competitive advantage. Our latest areas of focus include new technologically differentiated solutions, advanced process analysis and optimization. Of particular note in this area are world-renowned high-level solutions such as special laser welding machines, surface inspection systems, rolling equipment, and galvanizing lines for flat steel for the automotive market. With its own factory, Clecim SAS is able to manufacture and test complete machines. The company has many skills (engineering, manufacturing, testing) allowing it to master the entire value chain. Clecim SAS can also provide its customers with a pilot rolling mill for the development and confirmation of flattening, rolling and tribology models. Figure 11 – Clecim SAS (Montbrison city, France)Figure 11 – Clecim SAS (Montbrison city, France) About the author Graduated from Grenoble INP in Materials Science and Process Engineering, Stéphane GEORGES (47y) joined Clecim SAS in 2001. After holding several positions in the Automation, Modeling and Process Control department, and after 3 years of expatriation in Erlangen at Siemens in Germany, Stéphane moved to the position of R&D Project Manager. His various missions related to industrial processes lead him to use his skills in smart experimentation, statistical analysis and modeling, coding, machine learning and deep learning to make the most of data. Acknowledgment We would like to express our thanks and gratitude to Florence Kussener of JMP for her support in using the software and preparing us for our first ever participation in the Discovery Summit. References [1] André Augé, JMP Addict: Tips and Tricks Workshop: Customize Your Reports and Chart Builder Tips, Webinar, 2021, shorturl.at/jqAZ2 [2] Lekivetz Ryan, What is a covariate in design of experiments, article, JMP Community, 2021, shorturl.at/hqvH4 [3] Lekivetz Ryan, Developer Tutorial - Handling Covariates Effectively when Designing Experiments, webinar, JMP Community, 2021, shorturl.at/fvEZ0 [4] Peter Goos, Bradley Jones, Optimal Design of Experiments: A Case Study Approach, Wiley, 2011 CLECIM® is an internationally registered trademark owned by company Clecim SAS. Clecim SAS all rights reserved – 2022-feb-02

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

Sepsis is a life-threatening condition which occurs when the body's response to infection causes tissue damage, organ failure, or death. In fact, sepsis costs U.S. hospitals more than any other health condition, and a majority of these costs is for sepsis patients who were not diagnosed at admission. Thus, early detection and treatment are critical for improving outcomes. This presentation examines an actual clinical data set, obtained from two U.S. hospitals and recently published on Kaggle. In particular, a number of predictors, drawn from a combination of vital signs, demographic groups, and clinical laboratory data are examined. Using JMP, such issues as missing values, outliers, and a highly unbalanced, categorical outcome variable are dealt with. In addition, this presentation shows how visualization, interactivity, and analytical flow can lead to a more compact and integrated analysis — and a shorter time to discovery. Good morning. Good afternoon. Good evening, everyone. My name is Stan Saranovich, and I am the principal analyst at Crucial Connection, LLC. And I am located in Jeffersonville, Indiana, right across the river from Louisville, Kentucky, United States of America. And today I'm going to talk about sepsis predictions from clinical data using JMP Pro 16. But almost all of what I am going to do here will be available in the standard version of JMP. Now let's talk about sepsis for a minute. To start off, sepsis is a life threatening condition which occurs when the body's response to infection causes tissue damage and cause organ failure or even death. In fact, sepsis costs United States hospital more than any other health condition. So if we could predict sepsis and detect it early, we could improve the outcome of critical care patients and also lower the cost of health care. So we're going to look at a data set today, and this is an actual data set of clinical data. It was collected from two hospitals in Boston, Massachusetts, area in the United States. And these two data tables were published as a contest on Kaggle by a Cardiology group, and the results were eventually published in the Cardiology Journal. Now, there were three units involved in this study, and in this context, I have the data for what we'll call unit one and unit two, which is data from two ICU intensive care units. The third group of data was not made publicly available and was held back for the contest, and it's still not publicly available. So what we'll do now is examine the data and see if we can predict sepsis and what variables we should be following to avoid sepsis, which, of course, can be a life threatening condition. Now I have a data set in front of me, and this was downloaded from the Kaggle site and imported into JMP. And let's take a closer look at it. Usually I like to JMP right in to the data analysis, but in this particular case, that's not going to be a good idea. And after we look over the data, you'll see why. First of all, let's look over at the left of the JMP data table, and we have the columns window. I prefer to do a lot of the work from here and here we see a number of variables. So let's start with heart rate right here. That's going to be an important predictor, probably, but we don't know that. And we have the rest of the predictors over here. And there are actually 40 of them. Well, no, 38, if you don't count the units, two saturation, et cetera. And we could go through the list right here temperature. But what you'll notice and for this, I'm going to have to scroll is that the first six or eight columns are just clinical data. We have systolic blood pressure. We have respirations diastolic blood pressure. And as we go across our data table, we see some lab data. We're talking about glucose and lactate levels, magnesiums, phosphate, potassium, bilirubin, et cetera, et cetera. And finally, we have a set of columns, which I guess we could call it demographic data. We have the age right here, gender, and what unit they belong to. Now, while we're on units, let's take a look here. We have unit one and unit two. And we know from doing our background research and we all do background research. Right before we start the analysis, we know that one is a Cardiology unit and the other one is a surgery unit. So if they're in unit two, we have a one standard practice, and if they're in unit one, which they're not, we have a zero. Also standard practice. But notice something else here. We have a lot of rows where there is no unit. Now, we don't know where that data came from, not unit two. And the background that was published on the cable site tells us that the third unit is going to be held back to score the model. So it wasn't made public. So we don't know where that data came from. And I'll discuss that in further detail in a little bit. Now let's scroll back over and look at some other things about this data table. We have a whole lot of missing data and some of them let's take a look at some of the columns here, and we can just scroll down. Here the billerobin direct. There's one we had to go down 63 rows before we found one set of data. Here's another one at row 113. So there's not a whole lot of data in there. As a matter of fact, that column is only 3% populated, and there's a whole lot of other columns that are populated at a similar rate. That was the worst example, but they're a whole lot of five and ten. And we also have the problem with submission unit assignment. So let me close that data table and I will open another one. There we are. Now, I made some modifications to that first table and I decided to save some time and not make you sit through and just watch me clicking on columns. Notice over here in the columns area, we have these two symbols right here. One hides columns and the other one excludes them from the analysis. And if you'll notice this one particular one, et CO2, it was right about here in the first data set and it's missing. So we hid them and we're going to exclude them from the analysis. I also did two things which it's just personal to me. Number one, I moved our target variable. What we want to predict is sepsis to the left, so that when I view the data table, I can just scan across the rows and see some relationships if there's something I want to see. And they also moved this one over because I knew this just from all, for lack of a better term, general intuition that this was going to be an important variable and that's ICU, Los, which is intensive care, unit length of stay. Now let's look at some other things around here. I want to note one other thing I excluded. Where was it? Right here. You can't see it hospital admission time. And what that is, we think, is the time between they were admitted to the hospital and the time they were admitted to the ICU. And a lot of those numbers are negative, but there's nothing in the documentation that tells us how you can have a negative time. So I excluded those also. And the other rows that I excluded, like the bridge in there right above it. It's the same deal with those. There's a lot of missing data, so I just excluded those. And let's see, is there anything else I need to do here? No, it's time to do the analysis. But before we do that, let me tell you what the overall plan is going to be. And that is after we examine the data and clean it and prep it, which we've already done, we're going to look at the individual units. We're going to look at is sepsis, in other words, whether or not the people develop sepsis, and we're going to do some database management. So let's get started. The first thing I'd like to do is go up here to Tables and then go to Subset, and we get the pop up window from JMP. It says create a new data table, et cetera, et cetera. So let me click on that. And of course, I want to check this box here that says Subset, Buy. And I'll Scroll down because we knew there were two units and there was also, in effect, a third unit, which was neither unit one or unit two. So let me just click on unit one, and I want to pull in all the rows from that one. And I'd like to keep a Buy column just as a safety check, and you'll find out why in just a minute here. And we could keep the dialogue open. But I've run through this analysis before, so right now, hopefully I won't make a mistake. It won't change my mind. And there'll be no need to keep the dialogue box open. And for right now, we'll skip the output table name, and I don't want to link it to the original data table, but I'll come down here to this box and save the script to the source table and take one last look over this and everything looks okay. And I'll click the okay. Now let me separate these a little bit. Remember, we had unit one, unit two, and missing, which was neither unit one or unit two. And I've got three data tables here. And at the titles, it says unit one, JMP probe. Well, let's just look at that. It says here unit one. And if I Scroll over, it says unit one and unit two. And this is the reason I kept the Buy columns right here. It's all missing, dad. And we scroll down a little bit just to double check. And yeah, it looks like they're missing. So it says unit one. So what we want to do is relabel that right click, hit, edit, and we can change that to missing. And I'll type it in and we'll hit okay, and we now know what that is. It says unit one equals I mistyped it should be missing, but I'll leave it go for right now because we can't analyze that because we don't know where it came from or rather, we don't want to analyze it. So I'll just close that out and I don't want to save the changes. Now, this one says unit one equals zero. And I'll come over here and yeah, sure enough, there's zero in unit one, which means it does not belong to unit one. And over here it says unit two and there's one in the columns. Scroll down a little bit just to make sure. And yeah, it says unit two. Okay, so what we'll want to do is go up here and we can click that and we could edit one, edit it. And now I want to change that and we'll do that. And here where it says subset script. We can go up here and I want to change this. I can edit that and I'll change that to unit, too, but it won't do it right now. And it says right here, if we want to check, we're looking at unit and right here it says keep by column one, unit one, and that's the column it is. So I'll just cancel out of that for now. And it's the same with this data table over here. We can go to source code, or rather the source edit. It says keep by columns one, columns one, and we'll cancel out of that. And that's it for the units for right now. Now, what we could do is go up here and do another table platform. We've got summary and subset and whatnot we could click on summary, and we could drag what we want to summarize in here and we could pick our statistics, whether we want to count in the mean, standard deviation, the medium, excuse me and whole lot of other stuff, but we'll skip that for now and let me cancel that out and I'll close these two tables and I won't save them because I have a couple of tables down here where I already did this. So let me take come up here and I'll Select unit one and unit two and I'll open them. Now, if we wanted to see the difference in outcomes, for example, between unit one and unit two, we could analyze both of these status sets separately, but in the interest of time, we will not do that. What we will do, we combine them and analyze them together. Now, let's show another feature of JMP that makes the data handling part of our job easy. And let me go up to tables and what we want to do here is concatenate and the helper window shows up. It says combines rows from several data tables. We have a number of other selections we could have made to avoid us having to write some SQL code, some SQL code. But here we want to concatenate it. And unit one showed up on top and that's good. And what we want to do is concatenate unit two. So I'll click on that and I'll click that and let's give this a name. Let's call this. How about both this or not? Check it. I'll just check it here now. And we could create a source column and again as a check just to make sure everything is proceeding like we want it to proceed. I'll create the source column and I could keep the dialogue box open. That's right here. But I will close it for now since hopefully I didn't make any mistakes. Won't have to go back and we'll click the run button and let's see. It didn't come up. Let me try that again. Let me close that window. We'll leave it like that and let's see what happens. There we go. Must have fat fingers this morning. And this is a combined data table and as I mentioned, I'd like to keep that open. It's our source table. Normally what I do is drag that somewhere off the edge of the table. But for right now, we'll leave it there and we'll just scroll down a little bit and we note that we have unit one, unit one. There we go. Unit two. So that serves as a check. So we have that there. Now it's time finally to start the analysis. We can do a number of things here. First of all, we note that most of our variables are continuous except for our target variable, which is binary on or off, yes or no, et cetera. But let's just take a look at the distributions. I always like to point this out in the analysis. This is one of my favorite features to JMP. You can do a quick inspection right here to see if there's anything weird and let's see. Okay, scroll back over. I don't see anything that pops out at me. It looks like unit one and unit two. We're right in there. One thing to note right here is sepsis one there's. There's a lot fewer rows where the patient went into sepsis. And let's see, I think it was about 7%. So we're looking at 93 here and 7% here. So let me get rid of that. That's what I like to do. Now let's go up to the analyze menu, and we're going to make use of this pretty much exclusively from now until the end of the presentation. So I'm going to go down here, I'm going to choose multivariate methods from the drop down, and I get another drop down. And what I'm going to do, come on, is choose multivariate. And this window pops up and wants to know the why call. By the way, this is the reason why I like to put another reason I like to put the target variable over on the left. It's right here, and we can pop it into the Y column and we don't have to scroll and Hunt for it. So let's see, we've got everything in there. We don't want unit one or unit two. What else do we have? Gender, age? I'll tell you what, let's put them all in there and click my call hit. Okay. And here's what we get. We get a correlation matrix. Now let's take a little closer look at this. It's a little confusing because we put a whole lot of variables in. But again, that's one of the advantages to JMP. You can pop them all in there and don't have to write extra code for it. So we have our diagonal here. Our matrix is reflected along the diagonal. So it's the same data top and bottom and some different colors here. One means a high correlation. It's statistically significant, which is what we expect. The blood pressure, for example, should be correlated fairly well with itself. But we also note that right here, right under it, it's correlated with something called Map, which is mean arterial pressure. So that's mean between a systolic and a diastolic. So that makes sense. And we have DBP, which is diastolic blood pressure. So that makes sense. And there's not a whole lot we can see here. Here's another correlation. Okay. That's bun for the urea content. Blood urea nitrogen is what it stands for. That's the urea content of the blood, which is a byproduct of cellular physiology. And that looks like it's correlated with something over here, potassium, but pretty much that's about it. Here's a Hema crit, which looks like it's correlated with right over here, the HGB. If we read over, everybody see that square here? Take a second to look at it. And HCT is hemacrit, and HGB is hemoglobin. And hemoglobin is a direct measure of hemacrit or excuse me, the hemoglobin. And the hemochret is the I believe it's the volume fraction of red blood cells. So you would expect them to be correlated. So not a whole lot to see here outside of that. May as well close that. Next, we're going to go back up to the analyze menu. We're going to go to analyze. And from the drop down, we're going to go to screening. And again, we have another drop down. And let's Hover over this. It's called predictor screening. It screens many predictions for their ability to predict an outcome. So we want to be able to predict whether or not a particular patient is going to develop sepsis so that it looks like a good choice. So we click it and this is the window that we get. And again, we want to know is sepsis. One is yes, zero is no. So we click that and we're presented with the range again, the same range we had before. And let's do what we did before. Start here. We're going to go down here to gender, ignoring the units again, hit the shift button, click and we select all of them and we're going to hit the X button. And there's nothing else there for us to take note of. Doesn't look like anything else to click. So do that. We'll click. Okay. And there we go. And JMP tells us that it's doing a bootstrap forest. We could do a whole presentation on bootstrap forest, but we don't have time. In fact, we could probably do two or three or four. And we are getting there. It's scoring the results and we just have to wait. It's taking a while for some reason. And there we are. Let's look at what we have here. We have the contribution, which is the net contribution, not scaled, not scaled to the model. I have to use the word again portion that contributes to the model. And you can think of this as a weight fraction or if you prefer, multiply by 100 in your head to make it a percent. And if we just take a quick look at that, we see you at zero, 61 and three. So we're at zero, 74, six. Looks like it takes us up to zero, eight or 80% explanation. So that's what that is. So all those make sense. Now let's look at that Iculos before, I talked about excluding that well intensive care unit length of stay that predicts what looks like more than the others combined, probably. However, if we use that, that would be a little bit of circular reasoning. If people develop sepsis, they're almost certainly going to end up in the ICU. If they are really sick, they may be at higher risk of developing sepsis. So they are going to end up in the ICU. So if they're in their ICU, they're probably pretty sick to begin with. They're already developing sepsis and they're going to be in there for a while. So let's go up here in Excel. That really doesn't help us too much because it's not something we can measure like blood pressure. I mean, they're already in there or they're not in there. So let's exile to that and we'll go up here to analyze back down to screening, predictor, screening. Hit is sepsis for the Y response. And let's leave that one out. Let's leave the ICU Los out and we'll do everything exactly the same as before. Hit, shift, gender, select everything. Hit the X button. Nothing else for us to do there. It doesn't look like Hit. Okay. And we'll just wait for a little while. Again, looks like it's running a little bit faster this time. And here we go. Now we had the ICU Los completely out of the picture and we see something else here. The Bun blood urea nitrogen looks like it's in the running for a significant predictor after that temperature, which makes sense. If you develop sepsis, you have an infection. So you're probably going to have a temperature. Creatinine is a byproduct of muscle breakdown. So that makes sense. And remember, we did our research before we started the analysis and after that respirations, that make sense. Shallow, rapid breathing, hemoglobin content, Hema, crit. Okay, they're highly correlated blood pressure. And this WBC I didn't point out before, but that's white blood cell count. So that makes sense too. Now we have a decision to make. This is obviously the most prominent don't want to say important. We're not sure that yet, but it's the most prominent. After that comes temperature and creatinine. And then there's a large drop in the rankings and the importance of the rankings in the portion here. So we have a decision to make. So let's start up there with the blood urea nitrogen and let's go down. Let's pull in as much as we can because JMP is going to make this, all the repetitive tasks, all the calculations, easy for us by taking them away from us. So Shift click. Let's go down to systolic blood pressure. That's probably going to play a role because if you have sepsis, you tend to have very low blood pressure. Dangerous. And we have some other measurements down here, but we'll just skip those. By the way, these two are going to be correlated. This one right here is the partial pressure of the carbon dioxide in the blood and this is the carbonate content of the blood. So those are going to be related. So it doesn't look like there's anything else of importance. And JMP puts in a handy link there. It says copy selected. So let's do that. I copied the selected and I'll just leave that open for right now and we'll go up here to analyze once again. And what we want to do now is fit the model. It says fits a linear regression. So let's go there. And since we copied selection in the previous window, JMP remembered that for us. And what we want to do is click the add button and we'd like them to construct the model effects. That means we want to use them as a modeling variable. And what else we have here. Notice this upper right hand corner we have something called personality. So focus on that right hand corner for just the next 30 seconds or so. And I'll go over here and hit his sepsis and we'll put that in the wine. I'll look at the upper right hand corner personality. I click the Y and I get a choice here. I get some choices in the drop down menu and I get an emphasis window and I'll click the drop down triangle here and I have a whole lot of choices here and we won't go over them right now. But probably what I want is a generalized linear model. And if I Hover over it, let me try that again. It gives us a pop up window. It says fits a generalized model, and try it once again. And I get to select the distribution and the link and all that sort of thing, which I'll go over in a couple of seconds here, and I get another drop down here a link function. So let's start with distribution. And remember, we expanded that top column and we got our distributions. We took a look at it didn't look like there's anything weird there, at least on a macro scale. So let's just pick normal. And I want logic for the link function because we've got a binary variable that we're trying to predict. And let's see, take a closer look. Nothing else left for me to do. It doesn't look like. So I'll hit the run button and here we go. Here's our generalized linear model fit, and it gives us a summary up. Here what we looked at. We got a Kai square. I won't go over that in any great detail, but let's scroll down a little bit more, cut that off. Here we have an effect summary. And just by the way, if we click on these triangles, we can hide them or make them appear again. So depending on what we want to present, I'll just leave them all open for right now. And what we have up here is the source. And there it is. Bloody reinrogen. The bun is up there very high again, followed by temp, creatinine, white blood cell content, heart rate, and some blood pressure measurements. And all this makes sense from what we know about sepsis and log worth is contribution. And over here we have P value and we see that they're all very highly significant up to right about here, the white blood cell content. And we see a blue line here. And what that is the log worth of two. And the reason we take the log worth is because we'd like to be able to Doe things on the graph. So we put it on a log skill and that just makes it easier. Otherwise the spray bar up here would be off the edge of my screen. And this blue line here is a significance level because it's the log worth and zero one significance level is log of negative two. Excuse me, is the value of zero two, which is log of two. Get rid of the negative sign. And that's what this blue line is. So these are all significant up to HR. And if we come down here, we have the square results and we see some significance levels here, too. And basically, we're looking at the respiration, the bun and the creatinine and also the temperature. They're all highly almost forgot the white blood cell content down here. And I'm running out of time, so I won't explain a whole lot about that. But if we scroll down a little bit more, we can get an estimate for our predicted variables here. This is an estimate of the exponent and it gives us more statistical data on that. And I'm starting to run out of time. So let me just minimize those windows and we'll get rid of all the highlights and let's recap what we had here. Here is our original, Highly cleaned and rearranged data table. We want to predict sepsis, which is binary. And we ruled out the lens of stay in the ICU unit that is right here Because it didn't help us and it was kind of circular logic. And we've got our variables in three separate groups up here. We start off with the clinical and then we come over here and we had all our blood tests and then we had the demographic data and we had two units and we excluded all 25 or 30% of the data Because the data wasn't assigned to either unit and we don't know where it came from. So we got rid of that and then we subseted everything. Remember, we got to well, actually, we got three separate subsets Because the third subset was the missing data unit unit being in quotes. And we went from there and we went to the multivariate to look for some correlations. Then we went to the analyze screening, predictor screening, and we got what we figured was going to be our most valuable predictors to predict sepsis. And finally we went to fit model and let me reiterate on that. We went up to analyze fit model and we clicked that and we got this window and we dumped everything in there except for what we wanted to exclude. And we put sepsis right here in this y variable. And remember, this is the area that we had to focus on up here in the upper right hand corner. We had the personality and a couple of other selections to make, and we selected from that. And we got our results, which I went over about a minute ago, and that is the end of the presentation. I hope everybody enjoys even learned a little something from it. Thank you for watching, listening, and giving it your attention.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

These days, nearly every type of process equipment is able to collect a lot of data and then make it available to export for further analysis. This is especially true in R&D labs, where the equipment replicates the manufacturing process on a smaller scale and makes the analysis of data more useful to understand the process. However, many types of equipment use proprietary software, which can make it difficult to analyze the data coming from different systems. Exporting all the data into JMP data tables makes analysis easier, especially when it is in a familiar interface. It can also help link results from different process steps, driving us (hopefully) to discover unsuspected relationships. Over the last few years, we have imported data from several pieces of lab equipment into JMP, from the most automated solution to the "do it yourself" ones. The result was always the same: better data exploration. Welcome to my speech. I am Paolo. I work in a research and development laboratory in a pharmaceutical industry, Menarini from Italy. And now, I show you some of our work in [inaudible 00:00:16] with JMP. Nowadays, almost every process equipment is able to collect a lot of data, making it available to exporting and for future analysis. And especially in a R&D lab, equipment replicate in a small scale of the manufacturing process. So the analysis of data may be helpful to do a process understanding. But each equipment run its property software. Sometimes data analysis can result uncomfortable if done on the onboard system for so little screen or some touch screen that not properly easy to use. Export all data on a JMP data table make analysis easier and more comfortable. And also, it can help to link results from different process steps and driving us, hopefully, to discover an unexpected relationship between variables. We start to use JMP with the release number seven. So we have a lot of trials to show you. We start with the ancestor. Bulk and Tapped Density. The bulk density of the powder is the ratio of the mass to its volume, including the contribution of interparticulate void volume. Then, the sample density is increased by mechanically tapping. Because the interparticulate interaction influences the bulking properties of the powder, but also, the interaction that interfere with the powder flow. So a comparison of bulk and tapped density can give a measure of the flow properties. As comparison is often use an index that speak of the ability of the powder to flow, this index is the Carr index or compressibility index, calculated with the tapped and bulk density. And here, we have a ranking of flowability related to the Carr index. We started taking volume measure by end, after the 5, 10, 15 taps and so on, and recording it on a data table, on the JMP data table. After, we try to use a sensor, a light sensor, like these, that can measure the distance of the powder from the top of cylinder and record and then store the results in a data table, a CSV file, a comma-separated value file. The results are quite the same. Data, whether if they come from an automated or a manual data entry fit well with the appropriate Kawakita equation. This is the Kawakita equation plotted with the nonlinear equation platform of JMP. This equation explain how the powder settled during the tapping. And it has three parameter. The first one is the bulk density. The second one is the Carr index. And here, we can see 22 value that is not a good property flow of this powder. And this one speak about the speed of the settling of the powder. I think this is all for this first data acquiring. Again, we have another instrument. Now, we go to speak about topic form as cream or gel. Geological properties are important for topic dosage form because the viscosity influence the production, but also the packaging or the usage of a topic product. You can think to spreadability on the skin. So a proper flow characterization is a fundamental importance during the development phases of a topic form. Nowadays, flow and viscosity [inaudible 00:05:17] increasing or decreasing shear rate are simply obtained using automated equipment. Here, we have the picture of a 30-years-old rheometer that we had in our lab. And we had a dedicated computer system. But data can also set manually on a logarithmic paper. Or simply, we can do a data table, a JMP data table. And with a graph builder, we can have the same output of flow curve or the linear regression using data transformation and access transformation in a logarithmic. But very important, it's this because using the bivariate platform and using the spline fit, we were able to predict to get an estimate of the shear stress when the shear rate go near to zero. This is the yield stress or yield point. It's very important because it's... How to say? It's very... The maximum stress below which no flow occurred in the system. So the maximum stress below which the cream or the gel don't move. This is an important information when you plan a volumetric filling of a fluid material. And that was not available when we had all the instruments. So JMP was very useful to understand the behavior of our topic form. Come back to the solid oral dosage form, looking at single-station bench top tablet press. This tablet press is ideal for R&D development, for research and development because very often, only small samples of active ingredients are available for the first testing. With these, we can set, we can control independently compression force and weight to meet the tablet requirement and specification. It works with the same tools that are used in the manufacturing scale press. And we can plot tableting and formulation characteristic in order to eliminate or mitigate some potential tableting deficiencies. On this model, there is no automated data collection, so we simply enter data in a file, preferably in a JMP file. There is the data table. And we can get some important plot that are the compressibility, the compactibility and the tabletability plot. They speak about the behavior of the formulation under compression. And in this data table, we have some equation that relate the compaction pressure that we applied to the formulation and to varied characteristic, to the tablet's characteristics. With a study that cover from 50 to 300 megapascal, we cover theoretically all the compaction pressure that can be applied on tablet, on pharmaceutical tablet of every formal sites. Now, we close some one of these. And then we go... ... to the moisture analyzer. The moisture analyzer is a balance. It's a balance that heating the sample by an halogen lamp or infrared lamp. It can measure its moisture. It's also known as loss on drying or LOD because heating the sample, the sample lost its moisture, and we can record measuring the weight, the change of weight. It's important to know residual moisture of samples whether they are granules, powders or of other. But could be useful also to see the rate of loss as a function of time or a function of temperature. The analyzer is equipped with an algorithm software that collect data in XLS file. Import in JMP, it's easy. It's very easy. We need only to set the number where data start, the number of column where data start and the row, and click on Next. And after, on Import. Here, we have the data table with time and loss on drying. We have to adjust something about column name and so on. But here, we have the same data cleaned. And we repeated the measure three times. So we have three replicate, but we stuck it on the same column. And with Fit Y by X platform, we have the function that relate loss on drying with time. And I don't save it. You see there is a dedicated hardware and a dedicated software, but I also try to do a script. Do a script to capture the data. But it's partially work, but I never fix it because the simpler and more effective way to collect data is import the XLS file. But more or less, we can do the same: opening a new data table, define the column that we want in the data table, and put the JMP to wait data from the Port com3. That's all. The oral route of drug administration is the most convenient for patient. So tablet is the most popular solid oral dosage form. And so I speak a lot about tablet press. In the manufacturing, we saw a single punch press, but in a manufacturing environment, we use rotary tablet press. And also, we have in the lab, a rotary tablet press that has a large number of punches. Our press is equipped with strain gauges to measure compaction force, ejection force, and also the force needed to detach tablet from the punch surface. All these data are displayed and recorded by a software. They are very useful to monitor, to study the tableting process. Normally, software display data, but use also them to do a real-time weight adjustment of the process. But our version, the lab version give also the way to analyze the single event, get statistic, print report, export report, and so on. Here, we see some screenshot from our software. Raw data come as a text file, a txt file. The reason to open with JMP Here, I have the txt file. I try to open it using JMP. I select this All files. This is the text. It's better remove this one and choose Data (Using Preview), and go to Open. Here, we have the data that come from software. Data start in one, two, three, the fourth line. So we correct it. And click on Next And this is as character. No, we need the numeric data. And this is the same. Column three is the same of column one, so I exclude. This is the compression. Sorry because the software is in Italian, but it's compression force. This is, again, the time. So I deselect. This is the scraper that measure the first to the detach the tablet from [inaudible 00:16:35] surface. This is, again, time, and I deselect. And this is ejection. Click Import. And I get my data table. I have to correct this again, but nothing of difficult. Every data become numeric data. I have elaborated the JMP file, already JMP file. Here, using graph builder, I can show the process. This is the compression for one tablet, this is the ejection force, this is the scraper force. In the original software, I have on two different page. Here, I have in a three, the same page, the three variables but it's not changed. Now, I select one tablet. But if I enlarge the X axis, I think 30 seconds, not more to record. Here, I have the whole process that I recorded. I have also found the software another report that record the peak values of each variable. And this is can be useful for these I import on JMP. And here, I have for every punch, for every station, for every rotation, the maximum force of compression, scraper and ejection. For example, here, we changed the lubricant of the formulation that decrease the ejection force. And we can see the effect of this change on the three variables. This is the ejection, and there is effect on the change. And now, I close it. And we go to NIR process monitoring. Here, in the lab, we have NIR spectrophotometer, but there's a small form factor. You can see it can stand in the hand, and a Wi-Fi connection. The main characteristic is flexibility in installation of various equipment. The most common application is blend monitoring. Generally, tumbling mix for powder consists in a container, rotating on X axis, like this. And the most common are cubic-shaped container, and they are called bin. The NIR is mounted on the bin by a three-clamp flange. And during the mixing operation, the instrument collect a spectrum of the powder at each rotation. As the mixing will go on and the system become more and more homogeneous, the spectra become more and more similar to the previous. To get the most from the NIR data, it's mandatory to apply chemometry. And, of course, we do it with NIR software. But it's also possible to export data in XLS file and go in JMP. Importing data from XLS file is very easy. It's simply to open it. Here, we have the file. We can give a look, the raw data file. We have for every wavelength of NIR spectrum, we have the value of lab solvents at every rotation. The whole processes was 80 rotation of the bin. We can give a quick look to the graph, to a spectra, selecting the wavelength, putting in the X axis. Select, and Parallel Merged. Here, we have the spectra of each rotation. I said that normally, we have to apply chemometry to NIR data to get the most information, the more information as possible. A pretreatment normally used in this type of analysis is standard normal variate pretreatment that is a normalization of a spectra, subtracting each spectrum its own mean and dividing by standard deviation. Here, I do a script to do with this pretreatment. This is the file of raw data. We can run the script. Select the wavelength, and get the new data after predicting. With the graph builder, we do the same. Parallel Merged. Here, we have the difference of spectra, of raw spectra and... ... sorry, if I found it... ... Raw spectra. This one. And predicted spectra So here, we have the difference from raw data and a pretreated data. But we can see better in this file where the spectra are colored by rotation, from red to green. And you can see that the very first spectra is the red one here. And the last spectra are the green ones. And they are more and more similar one to each other respect to the red and yellow spectra. We can see these also with the principal component analysis. This one is the first rotation, and so on. And we can see that spectra become more and more similar and principal component analysis are more and more the same when we get the end of the process. Another way to see the end of a process is... the moving block standard deviation. Here, we see a plot from our NIR software. I will show you something in another windows about it. The high shear mixer. The granulation process is of real importance, really important step in pharmaceutical manufacturing. Granulation improve physical characteristics of mix of a powder as flow properties and content uniformity. Granules can be used as is in a delivery form like sachet or a stick pack. But they can also be pressed into tablets. High shear mixer are a key point able to do a wet granulation, in granulating powder by the binding solution and the shear force due to a rotating impeller. After, the wet granules will be dried in another step. In our high shear mix, we can control by software every process parameter: the speed of impeller, the speed of chopper, the rate of addition of a binding solution. And moreover, the software correlates some variables as part of temperature or power consumption. And data are stored in a CSV file, so it's very easy peasy to import in JMP. Here, but is data imported in a JMP data table. I have some column that I colored in yellow that come from some calculation about water amount added and so on. But it's important to see that with graph builder, we can see the whole process, the parameter of whole process. For example, we can see the torque measured by the software during the wet granulation and during the massing time; you can see how it changed. Here, in this picture, you can see that our NIR instrument was also fitted in this step, in this process to monitor the granulation process. Here, we have some results, always come from the software of NIR. We give a color to identify each phase of granulation. With the principal component analysis, we can see the start of the granulation. And when we added water, there is this change on the physical properties, on the physical aspect of the granules that become more and more wet, more and more agglomerated until the end of the process. Here, we have the same data in the transpose matrix to get a better visualization on graph builder. And we can see the spectra of a different phase. Here, we have our very first time of granulation when we are mixing the powder without adding water. Here, we have the adding of the water and the final massing time. And we can see that the peaks are changing. Here, we have the maximum absorption of the water in NIR spectrum. Here, we have the start, the medium point, and the end point of granulation. All this data can be resumed in a journal, like this, where we had highlighted a variation of a peak, independence of a single step of the process. Here, we have another two equipment, the fluid bed and the tablet coater suite. These are important, very important, keep maintain pharmaceutical manufacturing. We have a particular suite that is made from three units. The main control unit that is the same of the two process, and the other two units that are interchangeable for the other purpose of the process. We start with the fluid bed. The fluid bed is another way to get the wet granulation because wet granulation is not done only using a high-shear mixer, can be done also using a fluid bed. The fluid bed technology mean that the powder to be granulated are suspended in keeping motion by an upward flow of heated air. A binding solution is sprayed on the suspended powder, and the flow of area remove result during the whole process. The onboard software give us a total control of a process parameter, and every relevant variable is collected. Reports are stored for future analysis, and it's possible to export as PDF file. Here, we have a PDF file, reduced PDF file for the purpose of this presentation. I try to open in JMP. And here, I have this one. I know that in the first page, there isn't a table that I'm interested to import. So I click Ignore tables on this page. In the next one, this little one is not of interest, so I ignore it. And I ignore it also very last one. Here, there is graph of data, Ignore table on this page. Here, I have a preview of the table that I'm interested in. I click OK. And I have my data in a JMP data table. Something to fix as numeric instead of continuous, but it's not... ... of chart, but it's not a problem. Now, we see to go coater system. The coater system is to do a tablet coating. The tablet can be coated for several reason. The coating can have a specific function, for example, delay the drug release or simply it's need to reduce the dust during the packaging operation or have some cosmetic need as masking some bad taste of the tablets, and so on. Whenever the coating is done, spraying a coating suspension on tablet rotating in a drum. A flow of heated air remove solvent during the process. As the software is the same with the fluid bed, they have the same control unit, data and report are the same that we can get from the fluid bed. So here, we have a JMP data table obtained from the PDF file that we see before. Always using the graph builder, we can give an overview, a look to the process. Here, we see the spray rate or the product temperature and so on. Also, here, we try to do some application of NIR measure. We simply put the instrument inside the pan during the coating and taking a measure of weight gain of the tablet at various time interval. Here, we have the spectra, how change the spectra during the process. And we can see here that data are pretreated. That is a first derivative treatment on the raw data that are able to increase, to highlight the peak spectra variation on... ... on the report. Here, we do a relation between spectra sample and weight gain measure during the process. And so we have a relationship between spectra and weight gain. And for the next batches, we will able to predict the weight gain of the tablet simply take spectra during the process. Now, we close it. And it's enough. At last, we come back to the topic dosage form. The laboratory reactor that we have is useful to optimize process as mixing, homogenizing, dispersing in a lab scale. The system can be adapted quickly and easily to wide range of application. The main use for us is to do topic form as gel or also cream. It has integrated scale, pH and temperature sensor. The onboard system allow process control, display the process graph, but it's also able to store every processor relevant data in a PC as an XLS file. So it's simply for us to import in JMP and always using the graph builder. It's easy to see the whole process. For example, we can see the speed of dispersing system but also the torque or the viscosity trend of a system of a gel during each phase. Again, we try to use the NIR spectrophotometer to monitor the process. For this application, we add to the reactor a recirculation stream. And the NIR was placed on it by an appropriate flange. Data collected and elaborated with the NIR software can be imported in JMP. So here, we have the spectra. For example, spectra collected during the whole process or the principal component analysis. And we give a name to each phase, so we see by this local data filter. The first step when we do an aqueous solution of the base, when we add the active ingredient, when we had the ethanol or the gelling agent, and when have the gelification of a system, and when we have the finished product. We see here 19 spectra that are all in the same point of the principal component [school] plot. I spoke before about moving block standard deviation. A moving block standard deviation is simply a standard deviation of a block of spectra. And comparing the standard deviation of a current block with the previous, we can see how the variation of a system changed. So when the system become more and more homogeneous, the moving block standard deviation become more and more similar. Here, we have a plot. And we see the same that we have seen with the principal component analysis that are the five phase of aqueous solution, where are the second phase where the active ingredient is added, and also where the ethanol, the ethyl alcohol is added, and the gelling agent, the gelification, and finally, the finished product. And the moving block standard deviation become very, very similar and very, very quite to zero. Well, I think we have seen enough; we have seen a lot of process and a lot of equipment. They, of course, have software specially designed to control, to collect and to analysis process data. These software are not replaceable from another. They are important to control and to drive the equipment. But every system is standalone. So sometimes we can't use the equipment to do that analysis because it's busy with another project. Or sometimes we need to merge data from different step to have more global overview of the product. So we can do easily using JMP. Just import file. I thank you and goodbye.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

In the journey of delivering new medicines to patients, the new molecular entity must demonstrate that it is safe, efficacious, and stable over prolonged periods of time under storage. Long-term stability studies are designed to gather data (potentially up to 60 months) to accurately predict molecule stability. Experimentally, long-term stability studies are time consuming and resource-intensive, affecting the timelines of new medicines progressing through the development pipeline. In the small molecule space, high-temperature accelerated stability studies have been designed to accurately predict long-term stability within a shorter time frame. This approach has gained popularity in the industry but its adoption within the large molecule space, such as monoclonal antibodies (mAbs), remains in its infancy. To enable scientists to design, plan, and execute accelerated stability models (ASM) employing design of experiments for biopharmaceutical products, a JMP add-in has been developed. It permits users to predict mAb stability in shorter experimental studies (in two to four weeks) using the prediction profiler and bootstrapping techniques based on improved kinetic models. At GSK, Biopharm Process Research (BPR) deploys ASM for its early development formulation and purification stability studies. Hi, there. I'm Paul Taylor. I'm David Hilton. Yeah, we'll be talking to you about predicting molecule stability in biopharm aceutical products using JMP. We'll dive straight into it. Oh, yes. One thing to mention is just a little shout- out to this paper that's been published, so Don Clancy of GSK, which most of the work has been based on. Just to introduce what we do. We're part of biopharm process research, and we're based in Stevenage in UK in our R& D headquarters. We're the bridge between discovery and the CMC phase, so the chemistry, manufacturing, control. That would be the stepway into the clinical trials and also the actual release of the medicine. We have three main aspects. We look at the cells. We are looking towards developing a commercial cell line, the molecule itself by expressing developable, innovative molecules and the process in terms of de-risking the manufacturing and processing purification aspects for our manufacturing facility. Just an overview is to see why do we need to study the stability of antibody formulations? How do we assess the product formulation stability? The overview of the ASM JMP add-in that we've got, and the value of using such modeling approaches with the case studies. Just to reiterate about biotherapeutics, so these are all drug molecules that are protein- based. The most common you'll probably know are all these vaccines from COVID, so they're all based on antibodies and other proteins. These are very fragile molecules. The stability can be influenced by a variety of factors during the manufacture, transportation, and storage of them. Factors such as the temperature, the pH could cause degradation of the protein. The concentration of the protein itself can have a significant impact. The salts to use, the salt type, even the salt concentration, and even subject to light and a little bit of shear can have an effect. What's that cause is the aggregation of the protein fragmentation. It can cause changes in the charge profiles, which can then affect the binding and potency of the molecule. That could be caused by isomerization, oxidation, and a lot more. They're fragile little things, but also we need to keep an eye on the stability to make sure they are safe and efficacious. One way of looking at the stability is by subjecting to a number of accelerated temperatures and taking various time points. These long term stability studies can go up into five years. It can take up to 60 months. These will be more real time data. The procedure is extremely resource intensive. At each time point, we can use a variety of analytics such as HPLC, mass spectrometry, so essentially separating the impurities away from the main product itself, the ones that could cause the problems and then quantifying free mass spectrometry, like scattering or just UV profiles itself. We can separate by charge, size, or [inaudible 00:03:21]. Within that five years, when we're gathering that data, a lot can happen in those five years. We could have better developments in the manufacturing or the formulation. Does that mean that we have to repeat that five- year cycle again to get the stability data? Short answer is no. What we can do as an alternative is look at accelerated their stability studies. These are more short term studies, so we can apply that more exaggerated accelerated degradation temperatures and we can use shorter time periods. In a matter of months and years, we can now look at a matter of days, so we can go from seven to 14 to 28 days. This technique is commonly applied in a small molecule space, but not so much in the large Bison super space, because of the small molecule space. They involve a lot of tablets and solid formulations, and it's only starting to hit trend in the bioph arma industry with more liquid formulations. In terms of the stability modeling, we base our data using Arrhenius kinetic equations that can be both linear and exponential with respect to time. These are semi- empirical equations based on the physical nature. For example, accelerating is if there's a nucleation point for aggregation, it could cause an exponential growth. Conversely, when you look at decelerating, when there could be a rate limits in step, it could cause a slow growth in the degradation, too. All of these models are fitted and performed on a fit quality assessment through evasion information criterion, but also we can establish confidence intervals as well using bootstrapping techniques. This is where David Burnham at Pega Analytics. He's worked closely with Don Clancy on developing a JSL script or a JMP add-in for us scientists to use in a lab. Just going to give you a quick demonstration of the JMP add-in itself. If I can exit. During our ASM study is you collect your data, and obviously you put it into a JMP table. In this instance, we're looking at the size exclusion data, so we're looking at the monument percentage, the aggregate, and fragment. What we got is a JMP add-in that will give us the fitting. If I just quickly open it up. You go through a series of steps, so you can select the type of product. In terms of small modular space, where they're dealing with solid tablet formulation, you could use a model based on Aspirin, which is more of a generic approach, or a generic tablet where it's more novel and different to Aspirin. But for us in biopharma, because it's all liquid formations, we look at the generic liquid. We're good. Okay. Inside in the data, so just that data table I showed you earlier, you can also do a quick QC check. It's just a fancy check that everything matches up, and ensure that everything is hunky-d ory, or you can just remove it and replace it with a new data table. The most strenuous part of this JMP add-in is to actually match the columns up. In terms of the monomer, and aggregate, and fragments, those are the impurities that we're interested to model. We'll put those in the impurity columns, as well as matching up all the other important aspects such as the time, temperature, pH, and also batch which is something that could be of importance. If you have a molecule that has different lot releases and you want to see if all your lots are consistent, this is one tool that you could possibly use to ensure that your lots of lot variability is consistent. Al so to ask is if you have specifications, so if we have a gold target of having no more than 5 percent drop in monomer, so we're doing 100 minus the monomer in this case. But in terms of aggregate, we have no more than three percent, and for fragment, no more than two percent. In terms of the model options, we can select all the models that we want to fit and evaluate and also select just at a generic temperature and pH you would like to look at. You can choose that later on as well. That can be flexible, and also the different variants like the temperature only, and temperature, and pH. To save looking at fitting the data itself, you can either go for a quick mode, which could be a two- minute quick fit, which won't be as accurate as maybe the long term mode. But to save you all from looking at spinning wheel of death, I've already fit the data and then we can go straight into it. We can fit all the models. Once it loads up, eventually, you can have an overarching view of the prediction profilers of each type of model that's been fitted and evaluated. You can see that some have a confidence that are a little bit broader and some are a little bit tighter, so it can be either the model doesn't reflect very well in it or it could be overfitted. For scientists, we can then delve into selecting the candidate model. This is where it's based on the Bayesian information criterion. Apologies. You can also look at those two criteria and see which one is more appropriate. That you can use the drop down to see how the model fits in with the actual predicted values. Then last but not least, if we look at when you select the preferred model, this can give you the override in... Here you go. You can manually override which model you'd like to choose. But also, here in prediction profiler, is you can select the conditions you would like and extrapolate from it. Beyond the one month period, you can extrapolate all the way up to two years, and you can see how it fares in terms of its stability. One last thing to add is bootstrap techniques. If you want to find the control of how the bootstrapping is working and to get more accurate modeling of the confidence intervals, you can simply do that to each of your impurities. Trust it not to work. Okay, time to completion. It's done. Okay. You can see that you can look into it in a great detail. Okay. We go back to the presentation. We'll swap these. Okay. We'll be going into our first case study, which is looking at the stability of our formulations. They play an important role in drugs in general. They, not only just helping biopharmas in terms of stabilizing the protein during the storage or manufacture, but also it can help aid the drug delivery when subjected to a patient. Formulations contain many components, so they're called exceptions. These are generally inactive components within the drug product itself but they act as a stabilizer, so they could be the buffers, some amino acids, stabilizers, surfactants preservatives, m etal ions, and salts. But in terms of the formulation development, you can screen many of the excipients to find the ideal formulation and you can use design and experiments, and that's going into a different topic. But one way to test and prove that your final formulation fit the purpose is by doing stability testing. Our case study, we've looked at three different pH of formulation for this monoclonal map, and we stressed them at elevated temperatures. We looked at our ASM study from time zero all the way up to 28 days and analyzed them through size exclusion. A s you can see, it's the same snippet of the JMP add- in, so you can see how the models are being fitted, but also the extrapolation from the prediction profiler to see how the monitoring stability fares. When we look at the mon omer and aggregate, you can see we can take predictions from that prediction profiler at 5, 25, 35, 40 or even other temperatures. But within that model, we have an N1 value, and that can reflect on how fast the degradation is, either in acidic or basic pH. What we found is a minus value. It has faster degradation in acidic pH. We found that there was a higher risk of low pH rather than a higher one. Our next case study, which is on a similar trend, is looking at a different m onoclonal antibody, where we use its ASM stability study up to 28 days. With that same molecule, we had some historical data, which had five years worth of stability data. What we did is just taken the data from both studies and put them into JMP add-in and see how they compare. What you can see is highlighted in green, you can see that that is the model prediction. In the bottom is long term study, the real time study. In blue, you can see that the values fare pretty well. Then in red is the confidence intervals. They match up quite nicely, which is good. But one of the downsides to ASM because it's the short term study, if you look at the graphs itself, they seem to be quite linear. Whereas in real time data, they seem to be a bit more curved and exponential. But in terms of getting that data back and that actually prediction, it's quite good. That could help with some immediate formulation development work that you need to do rather than wait for long term stability data, I'll pass it on the David. Thanks a lot, Paul. Paul's given an example of how we could be looking to long term study stability predictions based on a month's worth of data. What we intend to do though is you're intending to design your formulation in order to hit a certain minimum threshold for stability time spent. In this case, what we need to do is having a fixed formulation and we're then trying to use this technique to find out what period of time does the product stay within a device for each threshold, and therefore what do linear in terms of how long we can hold this material? The material being in question is material that's been generated during the dashing manufacturer of the bio therapeutic. Essentially with this, you've got different unit operations, which are linked in series. What happens is your complete volume operation and then depending on shift patterns or utilization of your facility, it may be a case that you want to have holes in between different unit operations in a way in order to regulate timings of your process. One of the key things that we need to have here is to know that if we're holding our material in between the unit operation, what's the maximum period of time we hold it for material [inaudible 00:14:57]? The way that we normally look to do this is we have a plot, something like that, on the lower right hand side, where we just hold the material a month in a small scale study and just do repeated analytic measurements of our product quality interest to see how it changes and whether it falls within tolerances. In this case, you can see that we're looking for total 100 percent over time on the X- axis. In all cases, it's falling within those red bands, so we can say it's [inaudible 00:15:32] criteria that we are after. What we're looking to do in this study is we were rather than looking at the standard conditions that would be exclusively the standard conditions that the material could be held at, so that would be 40 degrees C. It was refrigerated or approximately 25 degrees C, so room temperature. What we're looking to do is to have parallel studies performed at 30, 35, and 40 degrees C, but only for a week, and then see if that high temperature data could be used to predict the low temperature data. What we've got in this slide is we've got a few snapshots from the data that we collected, which has been visualized within the graph builder. The person the panel box on the left hand side, you've got the data reflected at five degrees C and the columns in there represent material coming from different unit operations. Each row corresponds to a particular form of analytics that's being deducted. For example, if you measure the concentration of the level of [inaudible 00:16:38] species. On the right hand side, we've then got the equivalent data for equivalent unit operations and analytics, but just a higher temperature. Just doing basic plot like this, the first thing we can see is that the general trends seem to be consistent. If we were to look at the purple plots, in this case, you can see that the first column, so the first unit operation, we've got a descending straight line, whereas the second and third unit operations, you have a slight increase. Qualitatively, it looks quite promising in that an increase in temperature isn't causing any changes to the general trends that we're observing. In terms of a more positive prediction, this is where we then began to use the ASM add-in. In this case, what we've done is we've taken the 30, 35, and 43 degrees C data and we then use that transmitter model. In terms of the model bit quality, you can see from the predicted versus actual plot in the center that it appears to fit quite well, so that's reassuring. If you look at the model fits in the table on the top left hand side, we can see that the model fit with the lowest BIC score with [inaudible 00:17:57] model. This study was that we didn't have pH as a variable. That's why the BIC score for both of those is the same, because we're essentially removing that parameter. The linear genetic model and the external linear model are essentially equivalent. What we've then done is use this high temperature data to fit this kinetic model and determine what these kinetic parameters would be, so it's K 1 and K2 in the kinetic equation, show n on the bottom left of the slide. We've then changed the temperature value in that to five and 25 degrees C and try to predict what level of degradation we'd expect at that temperature and over a longer period of time. This is what's shown on the right hand side. We have the red lines corresponding to predictions from this equation based on the high temperature data, and that's then fit into the experimental data at lower temperatures just to see how good the prediction is. In this case, you can see that actually the predictions appears to be quite good. It gives you cause of comfort in this case. Sometimes, however, we noticed that this wasn't always the case. In this example, again, with the high temperatures, we've been able to fit... Have good model fittings which are predicted versus actual fit is quite strong. In this case, however, the model fit has been stipulated to be the best, in this case, is the accelerating connect model, so indicating that the reaction rate's getting faster over time. We then apply the same procedure to this set of data and we start trying to model what would be happening at lower temperatures. We can begin to see that the prediction is a little bit erratic. In reality, the increase in the level of this particular purity was fairly linear. The model was predicting that it was beginning to overestimate. It's quite a drastically high time points. I guess one thing that's important to bear in mind with this is that you also need to have an incorporated level of subject matter knowledge when applying these kind of technique techniques also. You need to have the balance between what's the best statistical model in terms of fit, but also what's the most physically representative of the type of system that we're dealing with. In terms of subject matter knowledge, another thing that's an area where it's important within this technique is the selection of the temperature that you can use in this study and the temperature range that you're going to look at. There's two reasons for this, and you've got competing forces. It's preferable to use as high temperature as possible because that means that the reactions in exceeded a faster rate. One of the issues that we often encounter with this type of study is that at low temperatures, fortunately for us, we're often dealing with products which are quite stable. But the inherent problem with that is when you're trying to use quite short time, quite narrow time series later in order to measure these changes, you often end up getting caught in the noise and your signal to noise ratio ends up being quite low. That's what's being demonstrated within these plots. The plots at the left hand side, are a kinetic rate fits, or reduced plots. What you typically expect is that an entirely temperature driven behavior, you'd have straight line. If you look at the top left plot, you can see that that's the case. We've got the first four blue points, which are forming straight line. That corresponds to 40, 25 degrees C. Across all of those temperatures, we've got entirely temperature preventing that behavior. But then the five degrees C point on the right hand side of that plot appears to be off. But when you dive into the data and find out whether this was because it was not available in that temperature dependence, if you look at the equivalent plot on the right hand panel, you then begin to see that actually because there's so much noise in the data that it's more of a fitting issue rather than a mismatch issue. If you were to look at the gradient of, say, one of those red line, blue line, despite the fact that the intercepts can be different, they can easily fit that data because there's just too much noise in the data to really be able to fully understand which one should be applied. In terms of general conclusion, though, I think what this is for this project we've been able to demonstrate is that JMP itself has a number of powerful built-in tools and with lots of knowledge of JMP scripting language or someone who can do this for you, those can be compiled in some form of user friendly package, which can then be used for quite complex analysis, which makes it accessible to most users. It's also demonstrated that by performing statistical fits to semi-empirical models, we've actually got a lot of tangible benefits from that and that we're able to make predictions about the future which in the past, we've not been able to do, and potentially, significantly reduce our timelines in terms of identifying liabilities with particular drug products. Frankly, this also demonstrates the importance that you can't be in areas such as this. You can't rely exclusively on statistical models. You also have to incorporate with your own subject matter knowledge. Try and work out which statistical model or kinetic model, whatever it might be, is the most appropriate to the situation you've got, and then which of those is the best fit. In terms of acknowledgements as also getting a lot of this work has been based on an original paper, which came out of GSK by D on Clancy, Neil, Rachel, Martin, John. This has been extended, so a biotherapeutic setting. We also thank George and A na for the supplying data which has been used to build this project, and Ricky and G ary for the project endorsement [inaudible 00:24:20].

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

After nearly two years, our experience during the COVID-19 pandemic has made us all experts in knowing the risk factors for developing a severe case. For example: male, advanced age, obese, hypertensive -- right? Well, it depends. When we analyzed data from the subgroup of the most endangered patients -- those who were already hospitalized and in critical condition -- we discovered some surprising differences with respect to the common risk factors. With a binary response (recovered/dead), fitting a logistic regression model seemed to be a reasonable approach. Due to the high dimensionality of the data, we used a penalized (Lasso) regression to select the most relevant risk factors. In our talk, we briefly introduce penalized regression techniques in JMP and present our results for critically ill COVID-19 patients. Hello everyone. Thanks for tuning in to my talk. I'm David Meintrup, Professor at Ingolstadt University of Applied Sciences. And today, I will talk about A Lasso R egression to Detect R isk F actors for Fatal O utcomes in Critically Ill COVID-19 Patients. Over the last two years, something started to increasingly bother me, and it is not what you probably think now, the p andemic, or at least not only. The topic is connected to this. This is me giving a talk on deep learning and artificial intelligence at the Discovery Summit 2019 in the wonderful city of Copenhagen. And since then, it looks like AI has become the universal tool for everything, so let me give you an example. In May 2021, the General Director of the World Health Organization said the following: "One of the lessons of COVID-19 is that the world needs a significant leap forward in data analysis. This requires harnessing the potential of advanced technologies such as artificial intelligence." So to me, this feels like a bit, forget about the scientific method defining the goal, specifying the tools, stating the hypothesis, et cetera, Just drop the magic word artificial intelligence and you are on the good side. So therefore, I decided to give this talk in the form of a dialogue between an AI enthusiast on the left side and a statistician on the right side. And I'm going to talk a bit more specifically on statistical models, artificial intelligence, and penalized regression. And then in the second part of the talk, I'm actually going to present the case study about the critically ill COVID-19 patients. So let's get started. Here's the first question from our AI enthusiast. In the era of artificial intelligence and deep learning, who needs statistical regression models? Here's my short answer that I borrowed from Juan Lav ista, Vice President at Microsoft. "When we raise money, it's AI, when we hire, it's machine learning, and when we do the work, it's logistic regression." So I love this tweet because in my opinion at least, it condenses a lot of truth in a very short statement. AI has become a universal marketing tool. But for the real problems, we still use traditional advanced statistical methods. A little bit longer answer could be the following. If I look at the typical task of engineers and scientists, they include innovate, understand, improve, and predict. And deep learning and artificial intelligence is mainly a prediction tool. For everything else, we still need advanced statistical methods like traditional machine learning, statistical modeling, and design of experiments. Okay, but you have to admit that there are very successful applications of AI and deep learning. Well, there's absolutely no doubt about that. For example, predicting the next move in a game like chess and Go. The deep- learning algorithms do this way better than any human being. Or I would like to introduce my favorite artificial intelligence application, which is solving the protein folding problem. The protein folding problem has famously been introduced 1972 by the Nobel Prize winner, Christian Anfinsen, who said in his acceptance speech that a protein's amino acid sequence should fully determine its 3D structure. And over the last 50 years, this problem has basically been unsolved. And there was very little progress until DeepMind by Google developed an AI- based algorithm called AlphaF old that to a very large extent solved the protein folding problem. And you see two examples of this on the right side. This is very impressive and beautiful work. And I included it here because I wanted to clarify that this is the perfect deep- learning AI problem. We have a vast amount of data, we have a combinatorial explosion of options, and the result we are looking for is a prediction, the actual 3D structure of the protein. So for predictions, we should always use AI. Not at all. For example, in the dataset that I will present about the critically ill COVID-19 patients, we want the model to predict if the patient will survive or will die. But the pure prediction doesn't really help. If you know someone is going to die, what you want is you want to treat, you want to prevent, you want to know the risk factors, you want to be able to act and not just simply predicting death or survival. So what we need is a really interpretable model that will hint these things that we need like treatment, prevention, and risk factors. Another way of looking at it, let's have a look at typical data- driven modeling strategies. What is very typically done in deep- learning AI environment is that you take all available data, you throw it in the deep- learning AI algorithm. And it might predict very well the outcome that you're looking for, but you're getting a fully non-interpretable model. What's an alternative? An alternative is to already in the data collection process think carefully what data do you need with advanced statistical methods. Then apply a statistical model, and as a result, you get a fully interpretable model. Okay, says our AI enthusiast, But you are missing an important point here. Statistical models might be nice for small data sets, but for big data, they can't be used, right? Well, no. For large data sets, there are several intelligent ways to reduce the dimensionality before you start with the fitting process of the model. And I would like to introduce, at least shortly, three of these intelligent ways to reduce the model dimensionality. Number one, redundancy analysis, something you might have heard about. If you have a large data set, the price you pay is typically that the factors are highly correlated and you can measure the amount of correlation within the set of factors by the value that is called variance inflation factor. And then you can actually eliminate the factors with the highest variance inflation factors. Why? Because they don't add additional information to the set of factors that you are already looking at. So this is one classic way of reducing the dimensionality of your data. If you have categorical data, for example, you look at an X- ray of the lung and you can see different symptoms. Then you can have variables that describe, this symptom was there, this symptom was there, or another symptom was there, X 1, X 2, X 3. Maybe for your analysis, it's enough to distinguish a normal- looking lung and a lung that has some symptoms somewhere. In other words, you convert a row only with zeros to a zero. And if there's at least one one, you change it to a one, and you create a new variable catching this information. Or to give you an alternative, you could sum X 1, X 2, and X 3, and count the number of symptoms that you see on an X- ray. This procedure is called scoring and is a very efficient way of reducing dimensionality. Principal component analysis has exactly the same spirit. It recombines continuous variables. It takes a linear combination of continuous variables with the idea of catching the variation in one newly created variable. I call these dimensionality reduction methods intelligent, because when you apply them, you already learn something about your data. And that's the whole purpose of statistics, isn't it? Learning things from your data. So let me summarize the advantages of statistical models. First, they can be used for all kinds of tasks, not only for predictions. Second, the model itself is useful and fully interpretable. And third, you can start in a large dataset with intelligent dimensionality reduction before you actually fit the model. I'm still not convinced. Can you give me an example of a statistical model that you applied to a large dataset? Okay, so let's introduce logistic Lasso regression. Lasso is an abbreviation for least absolute shrinkage and selection operator, and why it is called like that, I will explain in a few moments. Let's introduce this Lasso regression in four steps. Step number one is to remind ourselves of the logistic regression model. In a logistic regression model, we have a categorical, in the easiest case, two- level factor... Sorry, a categorical response with two levels, and we have a goal to model the occurrence probability of the event. This is typically done with this S- shaped function that corresponds to the probability of the event actually occurring. The functional term is given here to the left, but the good news is that with an easy transformation that is called logit transformation, you can turn the original values into log it values, and then the result is a simple linear regression on the logit values. So the bottom line is logistic regression, is simply linear regression on logit values. Step number two. This is a classic situation of a two- factor linear regression model. And how do we fit this to a data cloud? Well, we do this with the help of a loss function. For example, we take the sum of squared errors, and then we look for the minimum of this function. This is the very famous and standard ordinary square estimator that is the result of minimizing specifically this loss function given by the sum of squared errors. Thirdly, something that is maybe less known, I would like to introduce concept f rom mathematics the norm of a vector that is actually just a representation of the notion of the distance of a length of a vector. Let's look at three examples. The first one that you see here in the middle is the classic distance that you all know and use. This is the Euclidean distance. It's calculated by taking the squares of the coordinates and then taking the square root. The unit circle as you know represents all points that have a distance one, from the center. This is the classic Euclidean norm. We can simplify this calculation by simply taking the sum of absolute values. So instead of taking a square and taking a square root, we simply sum the absolute values. This is called the L_1 norm. And what you see here, this diamond is the representation of the unit circle of this L_1 norm. In other words, all the points here on this diamond have distance one , if you measure distance with the L _1 norm. Finally, the so- called maximum norm where you continue to simplify. You just take the larger value of the two absolute values, x1 and x2 . If you think about what the unit circle is in this case, it will actually turn into a square. This square is the unit circle for the maximum value. So in summary, we can measure distance in different ways in mathematics, and what you see here, the diamond, the actual circle or square are unit circles. So points with distance one, just measured with three different norms, three distances, three different notions of what a length is. Finally, let's combine everything we've done so far. So we start with the logistic regression model. We add the loss function. And now, instead of taking the ordinary square loss function, we add an additional term. And this term consists basically of the L_1 norm of the parameter. You see that we add the absolute values, Beta_1 and Beta_2? So this is the L_1 norm of the parameters that we add to the loss function. Of course, this is just one choice. You could also square these. Then, you would get what is called a Ridge regression if you take the square of the parameters. This first one here, the top one that we are going to continue to use is called the Lasso or L_1 regression because this term here is simply the L_1 norm of the Beta, of the parameter vector. Now, overall, what this means is that you punish the loss function for choosing large Beta values. And this is why this penalty that you introduce leads to the term penalized loss functions. So if you have a punishment, a penalty for large Beta vectors, then instead of doing ordinary least squares, you do a penalized regression. Now, let's look a little bit closer to the effect that penalizing has. So this is once again the penalized loss function with this additional term here. Now, the graph that you see here is independent of the so- called tuning parameter, Lambda. The larger Lambda is, the more weight this term has, and the more it will force the Beta values to be small. This is why you see that the parameters shrink. And this is why this whole procedure is called absolute shrinkage. Secondly, in this graph, you can consider this area here, this diamond as the budget that you have for the sum of the absolute values of Beta. And on these ellipses, the residual sum of square is constant. So you're looking for the smallest residual sum of square within the budget. This in the case drawn here leads to this point here. And due to the shape of this diamond, these two will typically connect in a corner of the diamond. And what this means is that the corresponding parameter is set precisely to zero. And this is why this method is also good for selection because setting this parameter zero means nothing else than kicking it out of the model. So this is in summary why we call the L _1 regression Lasso. It has a shrinkage element and a selection element due to these two described features. One last practical aspect of Lasso regression is about the tuning parameter, this Lambda here. How do you choose it? Well, one very common approach is the following: you use a validation method like for example, Akaike information criterion, and you plot the dependency of the AIC of Lambda. And then you can pick a Lambda value that gives you a minimal AIC. On the left side, you see how the parameters shrink and you can see the blue lines that correspond to the parameters that are actually non- zero while the other ones are already forced to be zero. I'm still not convinced. Can you show me a concrete case study? Of course. So the data that I'm going to present here consists of 739 critically ill ICU patients with COVID-19 that was collected in the beginning of the pandemic between March and October 2020. We have one binary response that consists of the levels recovered and dead, and we have 43 factors: lab values, vitals, pre- existing conditions, et cetera. This is the data that we are going to analyze now. So here, you see the dataset. It has 44 columns, as you can see down here. And 739 patients. Now, let's start familiarizing a little bit here with the data. We have this last known status, recovered and dead, age, gender, and BMI. And then we have additional baseline values, comorbidities, vitals, l ab values, symptoms, and CT results. Let's look at some distributions. So this is the distribution of the last known status. So you can see that unfortunately 46 percent of these patients died. We see here the skewed age distribution with an emphasis here above 60. And you can see that roughly 70 percent of the patients are males. Have a look at the additional baseline values. You can see here the body mass index, and you will notice that we have a tendency of a high body mass index above 25. We have quite a lot of ACE and AT inhibitors, and also of statins, so treatments for cholesterol and for blood pressure and some immunosuppressive. Next, comorbidities. You see that almost two-thirds are hypertensive, and we have quite a significant amount of cardiovascular disease, of pulmonary disease, and about 30 percent of our patients have diabetes. Now, for the remaining four groups, vitals, l ab values, symptoms, and CT, I'm going to show you one representative from each so that you can get a feeling of how these values are distributed. This is the respiratory rate. These are the number of lymphocytes. So this is a vital parameter, a lab parameter. Here, you have a symptom, severe liver failure that can occur on the ICU. And this is a CT result, areas of consolidation that can be seen on the CT of the lung. Okay, so now we are ready and we are actually going to analyze the model. So I go to Analyze, F it M odel. I take the last known status. That's why I throw in everything else as factor and I go to General Regression which is going to perform the Lasso regression. You can see here that the Lasso estimation method is already preselected. So if I click on Go, the procedure is already finished. It's very fast, and this is the result. This is the screenshot that you saw already on the slide. Now, I'm not going to work with this model for the following reason. If I go here up to the Model C omparison section, I see that I have 30 parameters in this model. So this model is still very big. If I go back down here, I can see that my AIC doesn't change a lot if I put it further to the left. Instead of doing this manually, what I'm going to do is I'm going to change the settings of JMP so that it doesn't take the best- fit, the minimal AIC, but instead the smallest within the yellow zone. So this is something I can do here in the Model Launch. So I take the Advanced C ontrol and I change Best Fit to S mallest in Y ellow Z one. I click on Go. And now, actually I have a very nice model with 16 parameters. Now, this is the model I'm going to use. And to be able to show you what factors are in this model, I'm just going to select them, Select Nonzero Terms. Now, I have these 16 selected, and I can put them in a logistic regression and activate the odds ratios. So on the top, we see the 16 effects in our model. And below, we can see the odds ratios. So for example, the odds ratio of age is 1.07. If you take this to the power 10, it will give you roughly a value of two, which means that with 10 years more, roughly, approximately your odds ratio doubles. Your chance of dying is twice as high as before. And then below here, we have the odds ratios for the pedagogical variables. So for example, coronary cardiovascular disease has an odd ratio of 1.62. Now, I would like to point out some of these results. So let's first look at factors that are well- known from the general population. So here, you see the dependency of the last known status versus age. And you can see how this increasing age increase your chance of dying very significantly. As I said before, the odds ratio of 10 years is roughly double. On the left side, you see pulmonary disease and cardiovascular disease that both also have a significant effect on the risk of dying from COVID- 19. And these factors are also valid in the general population. Now, more Interestingly, we find these three factors not to be in our model. So gender, BMI, and hypertension are not part of our model. So how is that possible? Well, it's critical to remember that our population consists of ICU patients. They are already critically ill. So we have 72 percent of male patients, we have almost 80 percent that have a BMI over 26, and two-thirds are hypertensive. So these factors will actually highly increase your risk of a critical cause of the COVID- 19 disease. But once you are critically ill, at that point, they don't matter anymore. So that was a very important result for us. Which factors carry over from the general population and which factors disappear in their importance once you are already critically ill? Finally, I would like to point out one more aspect which is statins. Statins were entirely insignificant. You see the P value here of 25 percent once you looked at them univariately. But in our multi factorial model, they were highly significant. And as you can see here, the odds ratio is below one, reducing the risk of mortality. So this is a very important lesson. Sometimes, people choose risk factors first in large dataset by looking at them univariately. And if you did that, you would with guarantee have missed statins, because univariately, it's completely irrelevant. But in our multifactorial model, we could show that statins have a protective effect against dying from COVID-19 once you are critically ill. This finding has been later confirmed by others. And just as an example, I included a meta- analysis from September 21 that indeed confirms that statins reduce the mortality of patients in a very large meta- analysis with almost 150,000 patients. If you're interested in more details, we published our work in the Journal of Clinical Medicine. I would like to take the opportunity to thank my co-authors, in particular, Stefan Borgmann and Martina Nowak- Machen from the clinic in Ingolstadt. And I would also like to thank you very much for your attention, and I'm looking forward to your questions. Thank you very much.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

An ultra high performance liquid chromatography (UHPLC) measurement system analysis (MSA) optimization and validation study in a quality lab is presented. Process settings of the analysis method are established in order to maximize measurement accuracy and resolution of two organic compounds. There are seven control factors and optimal DOE is used to specify how the experiments take into account specified model and experimental criteria. It demonstrates why OFAT is not appropriate and how to decide between a custom DOE and a DSD based on DOE diagnostics such as power, effect correlation and variance profiles. Using stepwise regression, good predictive models are obtained that are supported by validation experiments. The Profiler desirability function is used to determine the optimal and robust UHPLC settings for measuring both compounds. The particular importance of the sensitivity indicator for improving robustness is shown. Hello everyone. This presentation will be about Measurement System Analysis, and optimization of a UHPLC M easurement System. Presenters are myself, Frank Deruyck from HoGent University College of Applied Sciences, and also Volker Kraft from JMP Academic Program will take care of the demos demonstrating the JMP tools for data analysis. Okay. The problem statement and description. Well, in a chemical company, just [inaudible] chemical company, the presentation was inspired by an internship of a student and a chemical company, and of course the material had to be kept confidential. So I just will talk about the chemical company, and also the figures are a little bit modified, but no problem. What is the statement? Well, SPC revealed significant batch to batch variation in raw material and has caused problems in product quality. So that, okay, there was an issue with the supplier that was necessary that all supplied batches, it was necessary to analyze them all. But of course one problem is that there was a too slow analysis procedure GC, and a fast UHPLC analytic method is under development, was in development, but not ready for validation because of too strong measurement variation. And the goal of this study is to specify robust and optimal settings of the UHPLC method so that validation of the new method will be possible. Thanks, Frank. Working with the JMP academic team for more than ten years now, we helped many university professors worldwide to get access to JMP licenses for teaching, but also to teaching resources like the case study library at the link, jmp.com/cases, professors get free access to more than 50 cases, each telling a story about a real world problem and a step by step solution, including the data sets and exercises. What we present today is available as a series of three case studies, focusing on statistical process control, measurement systems analysis, and design of experiments. While Frank will talk about the problem and the solution they developed for a Pharma company in Belgium, I will demo some of the analysis steps using JMP Pro. Let me say thank you to Frank for sharing these cases with the academic community, who really welcome such real-world examples coming from practitioners in the industry. I also want to thank Murali, from our academic team in India, who plays a key role in enhancing our case study library like the development of these cases together with Frank. Okay, here in this plot you see very clearly the illustration the problem. So what here is shown is a plot of the measurements of the new UHPLC method, but non- optimized and also as a function of the measurement of the GC standard method, which was very accurate and precise. And you can clearly see that there are some problems. So different operators made some measurements on different batches and you can see that the prediction intervals are quite large. You see a range sometimes over 100 milligrams per liter. And you can also see that sometimes it's not clear whether a measurement is within specification, like on the left graph you can see it and also on the right graph you can see there's also ratios with accuracy meaning that there's a serious problem, and first of all we will explore the variation root causes using measurement system analysis, the causes of variation, and also the DOE will use optimized according to the statistical thinking concept which will be illustrated in the next slides. You see the statistical problem solving process flow. So for the cost of the problem, the problem with our U HPLC measurement error, we will tackle this by measurement system analysis, and to export the variation root causes, I may use of course DOE, also to optimize the process settings of the U HPLC system. The method we will use for quantifying the variation sources is the measurement system analysis, and some theory I will show. So it's about the quantifying of the variance components of the total variance. Total variance means the variance by all measurement, by different operators, different products. So we have two different components, the product variation, Sigma square product, and the measurement variation. The measurement variation very important is also decomposed in two important components, the repeatability the variance due to lack of precision by repeated measurements, and also the values between operators. So the Sigma square large R which is the reproducibility. And very important criterion in order t hat, sorry, important criterion stating that the measurement system is only suitable for detecting variation in the process, process variation, is that the percent G auge R&R, which is the measurement error divided by the total error, should be less than 10 percent. So the fraction of measurement error was below 10 percent total of the total variation. Then we can use it for process follow up. If it's not the case, it is higher, then we run the risk that we will control our process on measurement variation, and of course not a very healthy issue. So it must be lower than 10 percent, that's the main criterion. Okay, let me go to next slide. And for this we will use a Gauge R&R study, a Gauge R&R study is mainly an experimental design, so that we will select three random operators, John, Laura, and Sarah, who will do repeated measurements two times on four different batches each. So we have done, we have been able to quantify the within operator variation repeatability, the reproducibility, the between operator variation, and also the product variation, the variation between batches. So, Volker, I'll leave the floor to you now. Okay. Thank you, Frank. And before I will come to MSA, I would like to briefly cover what's included in the first part of the series, namely control charts and process capability. So this is one of the data sets here measuring the two compounds, our continuous responses, compound one and compound two, using the good but slow GC method. So data has been collected over eight days, with two batches per day and for two different vendors, A and B. So when the team started, the first activity was to check and confirm normality of the data for this, they looked at normal quantile plots. They also fitted a normal distribution. Okay, sorry for this. So they looked at a normal quantile plot, and they also fitted a normal distribution followed by a goodness of fit test. And there was nothing critical from that analysis. Exploring distributions, the problem became clear. So looking at data from different days, we can see a huge batch by batch variation, even for batches coming from the same day. And this means that e-process monitoring is not possible because the variability of this GC method was high and the method was too slow. So it needed a lot of time to monitor the batches. And therefore one team activity was to work together with the vendors and others to reduce that variation. And another activity, which is described in the other parts of the study, investigated a faster method, as mentioned by Frank, using the new UHPLC measurement method. So just looking a bit more into the old method, the GC method. So the team looked at one dimensional control charts and also process capability for both compounds, of course, and they also looked at multivariate and model driven control charts. And here we see that there were some extreme points in a multi dimensional analysis, and they could also see the contributions of the different compounds or responses. This is the conclusion of the first investigation using the old method. So here the outcome was that both processes looking at the process performance here, that both processes, so compound A and compound B, were incapable but stable. For vendor A and for vendor B, we see that compound one was even unstable following these colors here. So all of this motivated the team to improve the measurement process. And this brings us to part two of this series and this is about analyzing the measurement system. So this data here were collected for all the combinations, repeated twice, between four batches and three operators, using the new UHPLC measurement method. So the goal here was to measure all batches of raw material using this faster method and maybe to allow some inline monitoring of the raw material in the future. So to get started with the new data, the team looked at a two way ANOVA, and this may fool you. This output. So on both compounds, compound one and two, the batch effect is highly significant. So that's good news. And also the operator effect and the interaction between batch and operator, these effects are non- significant, but the RMSE is quite high. So that means that we are maybe looking at data and the effects we are also interested in are just hidden by noise. So before you look at such an analysis, the first question should be where is the variation in our data coming from? And second, are we measuring the signal? Are we really looking at the signal or are we just measuring noise? And to get an idea about this, a perfect visualization of the patterns of variation is a variability chart. And for these two sources of variation, batch and operator, here we see all the data points for all operators and all batches. So we have two measurements per batch per operator. And for these two, we see the mean. We also see the group mean for one operator and we also see the overall mean. That's the dotted line here for all our measurements. And we can look at this for both compounds, of course. So here, for instance, for compound two, we see that Laura has quite high variation, at least compared to the other two operators. So that's a visual analysis. The analysis method and procedure to use to get a better insight into the measurement system's performance is an MSA or measurement systems analysis. And this was also done for the non- optimized U HPLC method. So here, for instance, for the first compound, we see this average chart and this chart shows the data together with the control limits or with the noise band. And what we see here is not good news at all, because our measurements, our data is within this noise band, so it will be really hard to detect any signal with that noise level. Another output is this parallelism plot. So here we can check interactions between batches and operators and this would indicate an interaction if some of these lines are not parallel. And this is the EMP method. So this stands for evaluating the measurement process and you probably know this as Gauge R&R output. And that's what also Frank mentioned. So here we see the signal, that's the product variation, but we also see the measurement variation split into repeatability and reproducibility. And here for the first compound, we see that we seem to have an issue with repeatability. So the same operator doing the same measurement again. And for the second component, we see that there's also a slight issue with reproducibility as well. So these are measurements between different operators. So the conclusion here is that the measurement system or the measurement process is unable to detect any quality shift caused by a significant systematic variation between our batches. And with that, I hand back to Frank. Okay, thank you, Volker. I think we can shift to next slides because that's what we discussed. Okay, it's what you showed, Volker, that's also in the slides. Okay, yes, here we can start. And we start with first of all, in the design of experiments, the optimization study, the process improvement study. And first o f all, in order to better specify our experimental goals, we should first of all go to the root cause analysis for the high measurement error. The lab team, after a brainstorm with the lab team, it came out that were two main root causes, one link to the equipment and one link to the method. And the one link to the equipment was the main source for poor repeatability because it was a very strong issue with unstable column temperature and eluent flowrate, the UHPLC uses column, it's a chromatographic technique, and also an eluent to make separation, to lead the compounds to the column and make separation possible. As a matter of fact it was drift between different experiments and even within one experiment. This results of course in a poor repeatability. And the first task, of course, of the lab team was to stabilize of course the column temperature and eluent. Because if you go to any experimental design, of course, we need to have fixed settings of the temperature and eluent flowrates, which are of course two important factors. For the second issue, the method because, besides the stability problem, which of course was fixed, there was also an issue with low resolution because of non- optimized analysis process settings which were quite low and also unstable, meaning that when we make small shifts to flowrate and temperature, there were sometimes huge shifts also in resolution, indicating that there was also not only optimal problem but also a problem with robustness. So the goal was specify not only optimal settings but also robust settings. So the variation of the resolution should be minimal as a function of some variation around the settings of the process, practice, of the analysis. Let's just now go to the DOE. And the goal, of course, of the DOE is that we specify the response variable is Y, the compound concentration in the standard samples, and we have to make models of this Y, the compound concentration is a function of the UHPLC control factors. And once we have the equations, we can then of course go to optimization, and what is the optimization criterion? We will use the quality P over T ratio criterion. That means that, that our, fraction of the error in the tolerance range should be lower than 10 percent. If you have the tolerance ranges of our compounds which are specified, compound one in the standard sample was a 300 milligrams per liter plus or minus 200 milligrams per liter, then we have to make sure that and also the, sorry, the standard sample, two, two standard samples, the target of compound two and the standard sample two is 450 plus or minus 150 ppm spec limits. And of course if you want to reach 10 percent of this specification ranges, that means that okay, we have to given our desirability function for optimization, is that okay? Why should match target compound concentration lower than 10 percent? Meaning that for standard sample one there should be 300 plus or minus 20 ppm, 20 milligrams per liter . And for standard sample two the compound concentration should be 450 plus or minus 15 milligrams per liter that's the criterion for optimization. So as for our model getting together with lab experts. The factors are the main effects, the main effects and all quadratic effects. And the main effects are the temperature of the column, with the range 25 to 35 degrees Celsius, eluent flow rate, five to 15 milligrams per milliliter, and also a gradient. What is a gradient? Well that means that there is an additive, Acetronitrile, in the eluent and this concentration of this Acetonitrile increases as a function of the volume, added as a function of flow, the volume, through the column. That means from volume zero and one milliliter it's a range between five and 20 percent. And once the volume is five to six milliliters, we have a range of 35 to 70 percent of Acetonitrile in the eluent. Also an important factor is the UV wavelength . The detector is by UV and it should be controlled between 192 and 270. And brainstorming with the lab experts, who did already quite some preliminary experiments and had some experience with the UHPLC, as a matter of fact, only two interaction effects were selected. That's the temperature and eluent flowrate and also the eluent flowrate interacting with all gradient factors specified above. So this means that the design chosen to meet the goals and to model, and to specify the model parameters. It was the custom design, and Volker will illustrate what design was about. Okay. Thank you Frank. So talking about the third part of this case study series which is about designing experiment and what we learned so far is that we have to reduce the measurement variation which is caused by this non-optimized UHPLC method. And this method can be described, as Frank pointed out, by these control settings or process settings, like temperature and so on. So we also want our responses remain within their limits. So this follows the 10 percent rule as given by Frank, and these limits are also added to our data. So to design such an experiment the team looked at a definitive screening design. So this one here, and also added custom design and both with 25 rounds. So comparing both designs, they use the compare design platform, and you can see several reasons which are in favor of the custom design. So here, for instance, for main effects we see slightly better power for the DSD. But for the other higher order effects we see really a high benefit for the custom design. Same here, looking at the fraction of the design space we see that the custom design is doing better. You could also look at the correlation maps, and finally, the efficiency is also in favor of the custom design. And for that reason the team used the custom design, a custom design for those studies. So here we have the completed data for the custom design, completed with both response measurements. And for this we also have the corresponding linear model. So here for the first response, compound one, and for compound two, both with their profilers of course. And here's a combined profiler with both responses at the initial, just the mid settings. And by maximizing the desirability, so I would get to the optimal settings and we see that we are matching both targets here, 300 and 450 respective, we are matching them perfectly. However, we also see quite large sensitivity indicators. These are the purple triangles, and they are telling us that at the optimal point, our response surface is quite steep in some dimensions and this reduces the robustness of our process in case of some random variation of our process settings. And this can be further analyzed by adding the simulator to this profiler and that is done here. So here the simulator defines the random variation which was defined by our process experts. And just keeping the mid settings, plus this random variation, simulating 10,000 of response values we see that all of our response values are out of spec. These are all defects. Of course, we are just at the mid settings , so nothing better to expect here. Switching to our optimal settings and simulating again. So now we see that the defect rate to be expected is above 12 percent. So from here the robustness can be further improved, either manually, using the profiler and the simulator and the sensitivity indicators, or automatically by running a simulation experiment which is also built into those profiles. So the team used a manual approach and these are the robust settings. They came up with these red settings here. And if I simulate again, §we see that the defect rate now drops below one percent. So this is a Monte Carlo simulation, it's all random so the defect rate changes slightly with each new simulation, and you can also see how the histograms of our simulated response data behave quite well. So they are within our range, within our limits which support this 10 percent rule. And, at these robust settings. So going back to the other profiler. So here I also have some contour plots using the contour profiler, and they can also be used to better understand the best regions for the processing and for configuring the process and these are typically the white regions which provide the in spec regions for a combination of two control factors. And I hope you like this journey, and with this I hand over back to Frank to discuss this outcome. Okay, thank you, Volker. So, now we can go to the validation experiment, of course, once we optimize the settings now we can make an experiment to check whether really the measurements of the new UHPLC system and the GC analysis are equivalent. So for this, again, we set up a Gauge R&R study, measuring also one which is quite similar to the one discussed before but also now we make measurements with the GC in order to compare the two measurements methods. You see here also included as one extra factor in the Gauge R&R is the instrument factor. Okay. So here are the results and you can see really that we make quite an improvement that you see in the Gauge R&R results. That the main variation now is product variation. You see that now the batch variation is no more obscured by noise measurement error. The gray area is quite narrow compared to the ones we had before, and that's a very good news. You see it's mainly product variation. So the Gauge R&R and ratio is quite, matches our target. Meaning that precision tolerance ratio here is about eight percent. And the precision/tolerance ratio is six times the Gauge R&R figure divided by the tolerance of the compound , 400, which is eight percent. So the precision is okay. So that this measurement system is suitable to be used in quality control for compound one. Nice. Compound one. Okay. You see on the parallel plot just a little crossing, you see that for Sarah, indicating a little maybe interaction between the batches and operators. Okay, for compound two, we have to see the same thing even better. So only 5 percent of Gauge R&R, the precision tolerance ratio, 5 percent, same thing, very narrow noise range and no major crossing of the lines. Quite parallel, indicating no operator bias, no interactions. Modelling the compound one analysis, we see that it's mainly influenced by batch, and also little batch to operator interaction effect. Compound one and for compound two. Okay, now we can see this very small interaction fact because we have reduced the measurement noise so much now, that very small facts, of course now become visible. In the first time we could not see it, we could not detect it. But that is because of a very poor experimental power. But now we increase the experimental power seriously by just reducing the experimental noise, of course, now this interaction in fact becomes visible. But the small one, you can see that also for the link to Sarah, the green line of Sarah, but here also was a little problem in the GC analysis as well, not only for the UHPLC, but also for the GC analysis. Okay, that's an issue to be tackled for wrong. But now it's here. Those two graphs illustrate fairly clearly that both measurement systems are nearly equivalent, the UHPLC results versus the GC results. We see that there's nearly a perfect, very good correlation with all points on the mid line. So the slope is nearly one. It's not significantly different from one. And also there is an intercept with the y axis, is not significantly different from zero. So the method was ready for validation, UHPLC is accurate and the non-significant difference, which you see, standard analysis , so very nice results. And we could say that the problem with tackling the problem with MSA and DOE was very powerful, leading to a good, very nice solution to our problem which could be implemented, of course, now in production. Thanks for your attention and if there are any questions, please let us know.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

Sophorolipid biosurfactants, produced via the fermentation of Starmerella bombicola, show great potential as eco-friendly alternatives to the petrochemical surfactants that are currently prevalent. To increase marketability, a reduction in manufacturing costs is required. Therefore, production processes must: a) improve productivity via a greater understanding of the sophorolipid synthesis process and b) apply low-cost alternative feedstocks. By modelling the interaction of media components within the fermentation broth, it is possible to understand their role in sophorolipid production and generate a predictive model capable of maximising productivity when restrictions are applied (i.e., reduction in component quality/concentration). JMP was used to iteratively apply central composite design of experiments to analyse the effects of altering glucose, rapeseed oil, cornsteep liquor and ammonium sulfate concentrations on sophorolipid production after 168h of fermentation. An optimal composition was found that was capable of producing 40g/L sophorolipid and complex interactions between the media components were elucidated. The significant terms of the regression model were linked to changes within the conditions of the broth, providing a biological explanation for the output of the model. The results demonstrate the importance of using statistical tools to aid understanding of biological systems and the application of JMP to illustrate its findings.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

JMP is an all-in-one, self-service platform that offers a vast menu of easy-to-use analytical capabilities that can be assembled together to create an end-to-end analytic workflow. This presentation walks attendees through a case study where a workflow is built to solve a manufacturing challenge and to increase the value/understanding of analytics in an organization. The proposed workflow begins with data access components, then includes a combination of analytical methods to uncover discoveries in the data. It concludes by sharing these insights using the JMP Live platform. Today we're going to cover the JMP Analytic Workflow, so that you can see firsthand how all these capabilities come together. Anybody who's doing data analysis has a shared objective. They're trying to take raw information, data, and we're trying to turn that into a shareable insight or an actionable insight. The only thing that's different is the steps that we'll take as we move from one end of this process to the other. Whether you're new to the field of analytics and statistics and have more simpler needs, or whether you're a more advanced practitioner with more sophisticated needs, JMP Software offers the flexibility to meet those needs wherever you are in your analytics journey. The JMP Analytic Workflow is a quick and easy set of analytical capabilities to bring you from data to insights. What we're going to cover is a few workflows so that you can see how this can be implemented in practice. I want everybody to just picture right now that we're responsible for this machine, and this machine produces product for a business. Recently, the performance of our product has been outside of our expectations, and we believe the answer for what is going on can be determined by analyzing some of the data that's available on this machine. So what we're going to do is we're going to build an analytic workflow to see if we can figure out what's happening with this issue. Building the workflow involves these three steps. The first step is having an understanding of the data. The data that we'll be using in this case are machine logs that are saved on the machine as Excel files. The second step relates to the analytical capability. In order for us to know which analytical capability we need to use in the workflow, we have to understand the question that we're trying to answer, and the question that we're trying to answer here is a simple one. What is happening with this machine? The third element is a shareable insight. So after we get an answer to this question, we are going to have to share that insight with others. In this case, we're going to be sharing our insights with management, and we're going to be sharing those findings as a Word document. So more specifically, when we take a look at the workflow, we're going to be working with Excel files. We're going to leverage JMP's data access platform to bring the data into JMP. Once the data is in JMP, we're going to perform some data exploration and visualization, and then lastly, we'll share and communicate those results as a business document, which in this case will be a Word document. So to begin the process, we open our Excel file. When you open Excel files in JMP, it opens a special tool called the Excel Import Wizard. And the Excel Import Wizard allows us to do many things. We can access different worksheets in the Excel file, and we can perform some very simple data cleaning steps before we import the data. We can also preview the data. So as I look at the preview, I can see that I have information from June of last year, and I can see that I'm correctly capturing the measurements from our machine. I can import this data now into JMP, where we have a JMP data table. Now that we have our data, we can perform a visual exploration. I will use the Graph Builder tool, which is available under the Graph menu. And I'll plot our measurements over time for our piece of equipment. And as I plot our measurements over time, I can begin to see something that's quite surprising. The performance of our machine was meeting expectations initially, but over time you can see that the performance has slowly drifted, and now we're into a region where we're performing and producing bad material. This is the first time that we've now used data and analytics to understand what's happening in our process. And what we're seeing here is that the machine has actually been performing for quite some time without a calibration. So if we can calibrate the machine, we can get it back to the original performance for what we need in order to produce stable material. So this is quite a significant finding. And now we want to share this finding with our management so that we can take that additional action, which is to perform the calibration. When it comes time to share this insight, we can just simply export this, and in this case, we're going to export it as a Word document. So we're going to capture that Word document, and we're going to share that now with our management. So here we have the Word document. We've captured that visual, where they can see exactly what we saw in JMP, and they can see that the performance of the machine has been drifting over time, and that a calibration needs to be performed. This also represents the first time that management is starting to use analytics, and they're now starting to see the value of data in their organization and how that can help them improve their business decision making. And they have a new ask for us. They want to know what else can be done with their data, and what else can be done in terms of analytics to improve their manufacturing processes. So now the analytics journey has evolved, and JMP is very much a part of that journey. It's not just a tool that offers you the ability to access individual analytical capabilities, but it's also very much a part of the process so that you know how and when to implement certain strategies. So you spend some time reviewing a variety of JMP resources, and you review white papers to learn about best practices. You also read through customer success stories to see how others in your industry are leveraging analytics and how that's improving their business. You also participate in a complimentary statistics course where you learn about many things. You learn about predictive modeling, and how that can help you root cause production issues. You learn about reliability analysis and how that can help you understand how your product is going to perform in the field over time. But one of the most informative things that you learn about is the field of quality analytics. And by taking that learning now, you apply that to exactly what you're responsible for in your process. So with this new learning now you have a more advanced analytic workflow. So here in our more advanced workflow, we have some new iterations. Now, data is no longer being stored and accessed as Excel files. All the data is centralized into a database so that the integrity of the data is never affected, but also so that everybody can access the data without having to work with these individual files. In terms of analytical capabilities, you now have a better understanding of what statistics can do and what analytics can do. So your questions are more refined and specific. The question that you want to answer now is, is the machine experiencing special cause variation? Because you've learned about the difference between common cause variation and special cause variation, and you know that it's a special cause variation that ends up being problematic to your processes. And the last thing is the shareable insights. Before, when you were sharing your reports as Word documents, what it was doing was it was creating a lot of additional work for yourself. As people were consuming these reports, you're now getting inundated with requests to make modifications to graphs. You're also getting inundated with requests to the location of the most recent outputs. And then so what you want now is a better tool, one that allows you to centrally store all those reports in one location, but offers the people who are consuming those reports additional capabilities, so that they can perform their own exploration without having to come back to you for additional requests. So the analytic workflow that we're preparing now involves these steps. Our data now begins by being accessed from a database. We leverage JMP's database utility to get access to the data imported into JMP. We continue our data exploration and visualization, but we also incorporate some quality and process engineering elements that we've recently learned about from JMP resources. And when we share analyses, we want to both manage the content but also share these analyses with the wider audience in a way that's going to offer them greater capabilities, which they weren't able to do with their Word documents. And then so we'll be using the JMP Live platform to do this. So we begin the process by accessing our data. We use JMP's built- in query builder tool to access our data connection. Once we're connected to the database, we can access any of the data tables. Here, we've selected the data table that contains the data that we're interested in. And now, unlike before, we're able to pull data from everywhere in the factory, not just on an individual piece of equipment. We import the data into JMP, and we can now perform our new analysis, which leverages some new tools that we've learned about under the Quality and Process module in the JMP. In order to get an answer to the question that we have, it will require us building a control chart. The control chart allows us to set up a visual that looks very similar to what we created before, but the control chart allows us to access some additional capabilities that we weren't able to do in just a normal graphical visualization. Built into the control chart are some rules that we can leverage to determine if we're experiencing some special cause variation. And that's the question that we're trying to get answered. So we will enable some warnings, which are some special customized tests, to signal to us if we are experiencing special cause variation. Now that we've turned on that test, we can see that there are many batches where we're facing special cause variation. And had we been monitoring our equipment using this tool, very early on in the process we could have detected that there was an issue and we could have taken the appropriate action. So this is quite a significant finding, and this is something that we want to share with a wider audience. So now when we share this report, we share it with the JMP Live tool. So we're going to publish this to JMP Live. So I'm going to connect with my account to JMP Live, and I'm going to create a new post. And I'm going to share this post with everybody on my equipment team who 's interested in these results and needs to know these new insights that we've just discovered. I'm going to publish this report to our JMP Live. And now we can take a look at that report in JMP Live. So JMP Live is a web- based tool that allows anybody to access the report from their browser. So here now we can see that report in JMP Live. JMP Live also allows for all the reports to be centralized. So we no longer have to pass and share around static documents and Word docs , where sometimes people can be consuming old results and not be up to date with the latest findings. Now because everything is centralized in JMP Live, there's one version of the truth, and you'll always have access to the most recent files. You can also do things tha t you would not be able to do in static versions of the analyses. So JMP Live is still very interactive, and anybody consuming the report can perform their own exploration and get answers to their own questions without having to come back to you, the analysis, the person who prepared the report, for additional modifications. And as management is consuming this result and they're getting additional value, a very often thing is occurring, and their needs are now changing. And instead of seeing this information once a week in a weekly report, they want to see this information more rapidly. They want to see this information daily or even hourly, and they don't want a chart for just one piece of equipment. They want a chart like this for every piece of equipment in the factory, because they recognize how powerful this analysis is, and they ask us, "Is there a way that we can do this?" And JMP tool offers that flexibility to do this, because a critical part of the JMP Analytic Workflow is the ability to automate. As we were building these analyses, in the background, JMP was actually capturing the JMP scripting language to automate all of these steps. So by simply saving the script, we can stitch together all of these actions. We can stitch together the action to connect to the database and import the data. We can stitch together the action to generate the chart, and we can stitch together the analysis to upload the analysis to JMP Live. And in the click of a button, we can have those analyses automatically created by JMP. And in our case, we want these analyses produced every hour so we can leverage the Windows Task Scheduler to automatically run the script on our behalf, so that we don't even have to do it manually. So very quickly, you've seen a variety of different examples of how the JMP Analytic Workflow can be leveraged to solve a variety of different problems, depending on where you are in your analytic journey. We can put together the workflow to save both time and effort. We can easily access data from a variety of different sources and share your discoveries with other team members. We can get more from your investment. We can increase your efficiency without increasing the head count, and also eliminate the need for multiple tools. We can remove barriers in complexity. We can tackle problems of any size, like we've seen today, by using JMP's extensive suite of analytical platforms And we can accelerate process improvement by leveraging the automation to reduce time spent on repetitive tasks and get to those actionable insights faster. As we've seen firsthand today, your analytical needs might start off as being very simple, but when you're ready to grow, we'll be ready for you.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

This presentation showcases designing a special music hearing test to test a musician’s ability to hear melodies. The Definitive Screen Design (DSD) platform in JMP was utilized to consider six music script input variables (step, speed, notes changed, note level, repeat, difficulty) and then added two more center points for evaluating the Gage R&R performance. Each DSD run is a multiple-choice test allowing respondents to pick their response from four available choices. JMP Hierarchical Clustering platform was used to group similar music scripts from the 20 scripts provided by DSD runs and assign the similar scripts for the other three non-correct choices. The correct choices were then added to make each hearing question more challenging. Next, a stratified cluster hybrid sampling method was adopted to select 30 candidates to participate in the survey. Once the scripts were determined, a commercial music synthetic software program was used to create this DSD melody hearing test. After collecting the survey results, the Fit Definitive Screening platform in JMP was used to analyze the DSD survey results. The goal was to determine the best rater (higher propensity for accurate rating of musical melodies) to serve as the judge for next project phase. All right. Well, thanks, everyone, for joining us. The title of our project is Design a Digital Music Melody Hearing Test. I'm Patrick Giuliano, and my co- presenters are Charles Chen and Mason Chen who couldn't be here today. So I'm going to be presenting on their behalf. And this is a project, a high- school STEM project inspired by ESTEEM's methodology, which is basically STEM but with AI, math, and statistics well- integrated. Okay, so just to introduce this project in the project management flavor with the project charter. The purpose of the project, in effect, is to design a test to test the hearing capability of a musician. The experimental design, philosophy or methodology we use is JMP's powerful, definitive screening design capability. And we designed the test based on six music melody variables in order to test hearing capability, where each question starts with a short melody followed by four choices, and where only one is repeated and the other three melodies are similar but not identical. From this test, each listener has to pick their best choice among the options available. Once we designed this test, we analyze the test survey results. We build a sensitivity model in consideration of six music hearing variables, and then screen the listeners to determine which ones performed the best in the music hearing test. And in doing so, in the screening process, we analyze the strengths and weaknesses of their hearing capability in the service of ultimately creating an orchestra with a grading of listeners who are highly capable to evaluate them. Okay, so in the service of science, we have an introduction to the mechanism of hearing where the ear is just basically a frequency- receiving apparatus that collects sound and vibration of the ossicles in the ear and cause the mechanical vibration to be converted into an electrical stimulus, which is interpreted by the brain by the auditory nerve and ultimately by the brain. All right, so before we get into the experiment and the variables that we analyzed, let's talk a little bit about the frequency range of hearing among individuals depending on their age. So people of all ages without hearing impairment should be able to hear at a frequency of approximately 8,000 Hertz, and gradual loss of sensitivity to higher frequencies with age is a normal occurrence. And so what the science tells us is that the auditory structures of younger people are typically more capable of absorbing or interpreting hearing higher frequency sounds, which is, of course, relevant in terms of which instruments people are playing, where the violin has a higher pitch than the cello, so perhaps a younger person might be more suited for playing the violin than an older person. And so this just gives you an idea that basically people that are in their fifties maybe may only be able to hear at 12,000 kilohertz... or 12 kilohertz rather, 12,000 Hertz, whereas people in their 20s can hear up to perhaps 18 kilohertz. And just to give some context, the average frequency range for what we listen for the sounds that we hear most often every day is between 250 Hertz and 6,000 Hertz. Okay, so what are some challenges associated with hearing in the context of sounds of different frequency? So people typically miss high frequency sounds more often than low frequency ones. And people with high frequency hearing loss, they have trouble hearing higher- pitched sounds, of course, right? And so higher pitch sounds can usually come from women or children and are in the upper two to eight kilohertz range. And what's also typical with high frequency hearing loss in many people is the presence of a phantom sound, which is the condition called tinnitus, and that competing sensation of sound can also inhibit a person's ability to distinguish other high frequency sounds. So clearly, age is an important factor in terms of designing an effective hearing test and developing an effective panel of listeners who are attuned to music. Although we didn't explicitly consider age in our experiment, as you'll see in the subsequent slides, it definitely could be a factor that we could explore further in our sampling strategy in terms of the survey respondents that we choose. Okay, so the basic measure of hearing performance is called an audiogram. And what you see in the graph on the right is just a plot of hearing threshold level in decibels on the vertical axis versus frequency on the horizontal. And you can clearly see that as hearing loss progresses, the threshold level of sound and decibels starts to increase and the degradation and the performance is shown as the plot splitting the performance by year moving down into the right. That's the trajectory of the line that's connecting the points moving down into the right. Okay, so just a little bit more background before we launch into the design of the survey and the analysis. The intent here is just to emphasize that frequency interference can be a problem in producing a melodious harmony in an orchestra in particular or in any sort of musical composition. And what we're basically showing here is the difference between what's called fundamental frequencies and harmonics in the context of a piano, at least at the note scale indicated at the bottom. Okay, so what do we know about the music note frequency spectrum? Well, each note has, not surprisingly, based on the introduction so far, each note has a particular frequency. As an example, middle C is at around 262 Hertz, and higher notes, of course, are going to have higher frequency and lower notes have lower frequency. And this slide just gives you a context for what frequency the notes correspond to. So note A is around much higher, 440, then note C at 261 in the second set on the right, in the lower portion of the slide. Okay, so there's some relationship between frequency and the number of notes. Frequency needs to double every 12 notes, and we have 12 notes in each octave, seven white and five black. And so you can see that this relationship, that frequency follows as a function of these notes, and n is a power- law type relationship. All right, so taking us back now to the project and the implementation and the analysis. So the project plan has three phases. The first phase is what I'm going to cover, it's the analysis that I'm going to discuss today. The first phase is effectively the process of identifying which people are best hearing performers from a collection of survey results that we send out based on the survey that we designed. The second phase is identifying the best hearing performers from the survey results in order to serve as judges. In this phase, basically, we try to work on forming the orchestra prior to phase three where we're actually doing the forming. But in this instance, we're thinking about things like which instruments have any potential limitation. And we may give the same melody to different test instruments, and not every instrument can play every melody, obviously. And so the idea is, how do we know that the individuals that are playing are playing these instruments accurately? Well, we need judges who have good listening capability. So the judges that we curate from phase one will provide that excellent evaluation in phase two. So once we have that in place, in phase three, we can actually really form the digital orchestra. And we'll think about things like how many players should be involved, who should play where, obviously. We'll have a good understanding of how the melodies could be difficult for certain instruments. And this is why we need phase two in the middle. Okay, so here's our design or survey question design. So we've identified six variables for this hearing test related to music, the parameters in music: step, speed, notes changed, notes level, a repeat variable, and a difficulty variable, a categorical variable, easy or difficult. The experiment, as I mentioned before, is we're using JMP's DSD, and in addition to the default, we're generating a default DSD, and then in effect, we're augmenting the design by adding two more center points. So we're doing an 18- run DSD, which includes one center point, which is row number three in this table, to have indicated with a zero and an arrow highlighting row three. And then we're adding two more center points at row 10 and row 20 respectively. And the idea here in terms of, we're replacing these center points, is we want to get an idea of how consistent the results are throughout the experiment. So we try to put a center point roughly at the beginning, in the middle, and the end of the experiment. And this is analogous to understanding whether a measurement process is stable, if you're in a manufacturing environment, getting a sense for that. And then the other important thing about our design here is that we're randomizing the test sequence, and that's something that we can do in JMP through the generation of the design. And I'll show a little bit about that briefly when we come to the next few slides. And that randomization is really important because it helps eliminate any bias due to factors that aren't in the experiment when we run the test. And that bias is referred to sometimes as lurking variation or variation due to lurking variables. Okay, so there's another consideration that I touched on. It's in the context of randomization, but it's slightly different context, which is a little bit more unique to this particular application and experiment. And so basically, what we did is generated an initial random variable and assigned a random sequence, one, two, three, four, and randomized. But we did a recoding on that. So we labeled one A, two B, three C, and four D, and that' s what we see in terms of identifying the correct answer. So in the two columns at the right in this table, in this 20- row table, we're identifying what the correct answer should be in terms of the letter, which is associated with a random variable of one, two, three, four, w here one corresponds to A, two to B, three to C, and four to D. And we're doing this to ensure a uniform distribution of the location variable. And basically what that means in practical terms is that A, B, C, and D all have equal percentage of being selected at random. And this is to avoid the biasing situation where a student may pick the same answer over and over again in order to possibly increase his or her chances of performing well, or perhaps because the survey respondent isn't paying attention or isn't engaged in the survey. All right, so here is where we come to an evaluation of the performance of the design, of the DSD design. The way we approach this is through the evaluation of the statistical power of the experiment which is shown on the left, on the panel on the left through an evaluation of the confounding pattern or the extent to which factors are correlated in the experimental design, and that's shown in the panel in the middle, and the uniformity, what we call the uniformity of the design, which is simply, what does the structure of the design look like in a multivariate space? Have we covered all of the design points in an approximately uniform way so that we're able to predict across the entire range of the experiment with the same degree of precision? And so what these three indicate, and going back over to the left, is that the overall power for each of the factors in the experiment is greater than 90 percent, which is good. And it shows us that we have good sensitivity to detect effects, if they're actually there in the population. The panel in the middle shows that the risk of what we call multicolinearity or excessive correlation among the experimental factors is low because all of the pairwise correlations in this correlation matrix, most of them are blue, where a more bluish correlation corresponds to a lower correlation, where solid blue indicates zero correlation. And the squares that are closer to a red shading indicate a higher extent of correlation among factors or terms in the experiment. And so overall, what we see is that we look for correlations that don't exceed 0.3, and that's typically all the squares in this plot with the exception of those slightly reddish squares where the correlation is a little bit higher. And that's because we have categorical factors, right? We have at least one categorical factor in this experiment. And if we didn't have the presence of a categorical factor, this plot would look even bluer. So we say that in DSD, we don't recommend adding too many categorical variables into the experiment, because if we do, then we increase this correlation problem, which affects our ability to produce estimates in our model that are precise, leads to inflation of variance in our estimates. And the final plot on the right, on the far right, which is an indication of the uniformity of this design, is a scatter plot matrix in JMP, and it shows each variable versus every other variable. And what we're looking for is for white space to be minimal in this plot. What I've drawn is a little circle here which your eye can easily pick out. There's a little bit of extra white space there at the intersection of Repeat and Step. And that, again, is because we have a categorical variable in our experiment. And so truthfully, there's no perfect zero in the main effects, no true center point in the main effects due to the presence of that difficulty variable, the categorical variable. And that's reflected in the non- symmetric pattern of the scatter plot matrix on the right, slightly non-symmetric, where that asymmetry is indicated in that white space and with the circle that I've drawn. Okay, so before I discuss this slide, I just want to quickly show you how I got to these design diagnostics. So what you're seeing here is the table that I just showed you. And I've generated this design using the DSD platform under the DOE menu, under Definitive Screening and Definitive Screening Design. And after I did that, JMP already generate, after I complete the design table generation process and fill in the results, JMP generates a DOE dialogue script, saves it to the data table, and I can actually relaunch the DOE dialog, and I can also evaluate the design. So I'm going to go ahead and quickly click on Design Evaluation. And this is just an overview of the design. And right here under Design Evaluation is where I get the diagnostics related to p ower, which I showed you on the left panel on that slide, the diagnostics that indicate to the extent to which factors or terms in the experiment are correlated. And that's shown here in the color map on the correlation. And to generate the plot, looking at the uniformity among the factors, I actually have to go in and do that in s catter plot matrix under the Graph menu, S catterplot Matrix. So that's just some context for you. And now, I'm just going to quickly bring up the next slide and then come back to JMP here just to dynamically show you what we're doing. So here's probably the most interesting part of this experiment. How do we increase the survey test difficulty and do it in a smart way? Well, we can use hierarchical clustering analysis to do that. Now, we already know the correct answer. It's indicated here in the corresponding column. The Choice column, the columns of the four variables on the right which indicate the choices corresponding to the 20 melody choices are indicated there. So we know, for example, in the first row, the correct answer is C corresponds to melody one where the C ID number is one. So we already know the correct answer where we've assigned it in terms of row order based on a random number, but how do we pick the other three answers? Well, based on hierarchical clustering, we can get a sense of how close each of the other three answers are to the correct answer. And in this way, we can make the test a little more difficult. So all the answer choices are from the 20 melodies. How do we pick the closer formalities for each question, or the closest formalities, if you will, or even maybe melodies that are relatively close together based on the clustering criterion, but not honoring that criterion strictly, right? So this might seem a little bit nebulous, but in effect, all we're really doing is telling JMP to assign a clustering scheme by row and based on some clustering criterion that we specify. And by default, that criterion is Ward. So I'm just going to show that dynamically here. So I have the table open. All I did here is run Hierarchical C luster under the C lustering menu. And once I ran this, I went ahead and invoke Cluster S ummaries, which I turned on here. And then watch what happens here when I click on each of these clusters. So you can see that when I click on each of these clusters. These are the clusters. So seven and 18 are associated with each othe r, 14 and 17, eight and nine, row two and 13, and so on. So this is the idea. We're using the power of JMP to identify rows that are associated with each other. And in this way, by arranging the answer choices close to each other, we make it relatively close to each other by following some schema like this, we make the test more difficult. All right, so just launch back into slides here again. Okay. All right, so basically the last step here in terms of completing this experiment is in addition to using a passive criteria for increasing the difficulty of the test, we want an active criteria. So we want to be able to separate, in effect, the beginner level from the advanced level. So think of it like this. If every question was super difficult or if all the choices were very hard to discriminate from, then you wouldn't be able to distinguish between an advanced- level respondent and a beginner- level respondent because everybody would miss all of the questions. Similarly, if you made all the questions too easy, then you'd have all experts and no beginners, and so you have no differentiation. So based on the science, we have a hypothesis that step and speed are the most important factors for performance, for hearing performance, for discriminating between a good melody, a good composition, and a bad musical composition and a bad one. So are we sure about that? Well, one thing we can do is we can re code the step and speed by a 50 percent reduction if it's at difficulty level equal to difficult. And by doing that, in effect, we still have five variables and those are indicated in the shaded, right? So the recoded step, recoded speed are the two columns that are shaded. And then we have the notes changed, the notes level, and the repeat. So the DSD is still orthogonal. We still have three levels. We have five variables, but actually we could incorporate up to six in the DSD design. So how do we increase our value in effect by increasing that variable number to six? Well, we can add the difficulty variable or the categorical variable which indicates either easy or difficult. So we decided to use step and speed, combined with these other three variables, and the total sample size is still 18 plus 2 or 20 with the two center points and the one center point by default. But now, we get five levels for speed and step, not three. So by doing this little transformation, we smartly create five levels on two variables instead of just having three levels, which is typically what we would have in a DSD. So I think this is a unique approach that's also quite specific to this problem context and gives us more levels in our design. Okay, so this is our design. How are we going to create the... What software are we going to use to basically generate the hearing tests? Okay, well, this is just an overview of Music S oftware S ynthesizer, which is what we use, soft synth. And we utilize it to create 24 multiple choice music melody hearing tests. It's obviously convenient and portable and fast. All right, so how do we distribute this survey smartly? Well, our approach is... Many people do one sampling method. But here, our approach is to integrate all the different sampling methodologies, cluster sampling, stratified, and some additional clustering within in order to distribute the survey to the right audience to make the survey the most useful. So when you're ready to send out the quiz, how do you do it? Well, I have some examples here. Who should play the music? Well, there are people who know the music and people who don't. So we only want to send the surveys to people who are already familiar with the music, right? Because ultimately, we want to use these people to evaluate the performance of an orchestra. In the stratified sampling sense, we have different kinds of instruments. We may have five students in a particular pool that know how to play piano, we may have two that know how to play violin, and we may want to sample smartly so that we only pick a certain number within each strata of players, people who play particular instruments. So we may pick randomly within each of these strata in a certain sampling rate. And again, with respect to clustering, we can think of location in terms of practice location or geography as a selection from many different geographies. In a sense, we cluster and limit our selection criteria to only the San Francisco Bay area, because practicing in person is much easier than practicing virtually. Okay, so really the point is that this survey dissemination and survey data collection processes is very holistic and increases our chances of producing an effective test set, if you will, of evaluators to help us form the most high- performing orchestra. Okay, so quickly, to wrap everything up, we studied the human hearing frequency range, the instrument frequency spectrum, the music frequency formula, and we designed an innovative music melody hearing test using DSD. We also implemented two interesting approaches to increase the difficulty of the test, hierarchical clustering, as well as rescaling the levels of the most important predictors on our responses for the test answers. And we use the music synthesizer software to basically disseminate the hearing test across the six music melody variables. And in our strategy for dissemination, we use the holistic sampling methodology. So this in closing, some of the approaches that we use and the science that we developed could be used to develop a hearing aid, a music melody hearing aid. And in our current market that we're aware of, hearing aids are really specially designed for people with hearing loss, but the idea here would be, how about making a hearing aid that's about amplifying a certain signal from noise, right? And that would, in effect, increase music melody hearing and detection, right? And so the main objective here would be to block out noise that's extraneous, for example, noise from the audience, and then amplify the signal portion for the particular frequencies that are important for playing a particular instrument, or even using this type of technology to even out the pitch, to amplify the transition between melodies. And so in future work, a similar DSD design can be implemented in terms of developing this kind of technology. So thank you very much for listening and let us know if you have any questions.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

For a new WLAN product (sixth generation Wi-Fi), a yield loss issue is observed at the final test, when dies are packaged for shipment to customers. A project is launched to find the issue root causes. The main quality tool used is a fault tree analysis to dig deeper into each potential failure mode, without excluding some failure possibilities or jumping directly into an a priori conclusion. NXP has been using machine learning methodology and algorithms from some years now, so machine learning was implemented in this yield loss case in parallel with typical data analytics or univariate analysis. The final product is constituted by two dies, mounted on a laminate. No issue was observed at the die test step after die manufacturing and before its packaging. Furthermore, the yield loss issue was observed on the product model manufactured only on one type of laminates, which led to hypotheses on the laminate type impact. Data analytics and machine learning analysis (partition, boosted trees or bootstrap models) were built. The analysis mainly dealt with the difference between the unit test and the final product test and on the difference in the laminate models.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

A picture is said to be worth a thousand words, and the visuals that can be created in JMP Graph Builder can be considered fine works of art in their ability to convey compelling information to the viewer. This journal presentation features how to build popular and captivating advanced graph views using Graph Builder. Based on the popular Pictures from the Gallery journals, this seventh installment highlights new views available in the latest versions of JMP. This presentation features several popular industry graph formats that you may not have known can be easily built within JMP. Views such as dumbbell charts, word clouds, cumulative sum charts, advanced box plots and more are included, helping you breathe new life into your graphs and reports! Welcome everybody to Pictures from the Gallery 7. So my name is Scott Wise. I'm a Senior Systems Engineer in the US West Coast and I'm joined with my daughter Samantha. So Sammy, I wanted to ask you, as a 16 year old growing up during a pandemic and still going to school and trying to find your path in the world, what are you most concerned about with the future? Well, to start, I'm pretty worried about how we're affecting the environment, like deforestation, soil depletion, climate change. Additionally, I'm kind of worried about sexism in the workplace, and gender gap wages, and things like that. Okay, well, that's a lot to think about. So you got me thinking as well about what we can all do to help make this a better place. So I thought I'd dedicate my presentation to also emphasizing what we can do to save the planet. You used to say to me, "Be curious...", right? "and do something about it", right? So we can use our curiosity, our time, and our skills. So I'm going to challenge everybody that's seeing this video and myself as well, that we share meaningful data. So a good place to do that is with the Data for Green J MP website. It's a good place to see what kind of sources of data are out there for us to understand things about our environment, as well as to share any meaningful data that we have or any meaningful graphs or reports that we're able to generate. So please do check that out. I'll leave the link in my Journal for you, as well as use your time. So here's a cool use of your spare time. Instead of maybe playing your cell phone games is you could go out to this II ASA website and on this website, they give you images of the rainforest. They take it from the satellite and they ask your help in identifying where there's roads and structures, but also where there's untouched portions of the rainforest. And this feeds their artificial intelligence models to help them do a better estimate of the rate of deforestation in the rainforest. So use your time and then definitely use your skills. And if you're here, you've picked up some good JMP skills to analyze, graph, and explore your data and get inspired from our friends like at WildTrack, where Sky and Zoe are doing a great job. Not only helping protect our endangered species with their non- invasive wildlife tracking methods, but also using analytics, using JMP in very novel and creative ways. So definitely read up on the stories to see how we can do better and we can inspire ourselves to also use our skills to do better. So thank you, Sammy, for all that help. And I'm going to continue on and to show the pictures from the Gallery. And instead of just showing it on fun data, I'm going to show our advanced pictures on data that reflects some data for good, Data for Green topics. So what we got keyed up, number one is this equality. So some equality data on the gender wage gap, which is we're going to do the interval charts, also called a dumbbell chart. That was rated number one in terms of selections this year. Number two would be this pandemic data, which we are looking at how teachers are affected by teaching in the pandemic. And we're using a word cloud. B et you didn't know you could do word clouds within Graph Build. Third one is going to be looking at tree cover loss over our planet from this new feature, which is a smoother line which uses moving averages. Last is going to be some safety data, or next to last, which is the points cumulative sum chart. And so this is a new way of quickly looking at cumulative figures within your graph, even when you're just looking at the points element. So another thing you can do in Graph Builder is now do some, I think, better looking and more advanced box plots. And so we're going to do that on climate change and looking at some projections for city risk going out to 2050. And then if we have time, I will show you a little bit about some things we can do with summarizing vector data, such as wind direction and speed, which comes into play a lot when we got a lot of adverse weather due to climate change. And a wind rose chart is what we're going to feature there. So these are the pictures from the Gallery I will go through each of these individually as time allows here. Now, just so you know, this is a Journal I'm giving to you. It is already out there on the link. You can download it. You will have full pictures of the graph. You will have tips and tricks. You will have, as well, full steps on how to create that graph. And I will leave you with the data. And you will not only have the data, but you will have scripts in the data to regenerate the graphs so you can build your own and you can compare and see if your graphs look like the ones that we created. All right, so let's get going. So our first chart here is gender wage gap, and it's a dumbbell chart. And it's because we have these interval charts. And with these interval charts, you might see that we kind of have large points on the end of the interval. So you're trying to make a comparison between two points, and you can see the distance in between them as a bar. And some people thought these looked a lot like weight lifting. You go to the gym and you're lifting some free weights and you're going to pick up a barbell or a dumbbell, and it's going to have weights on the end with the bar in the middle that you grasp. That's kind of what they think it looks like some people also call this an interval chart. So you can see it's easy to see that males make more than females in France. And there seems to be a large wage gap that doesn't seem to be closing as you go through the years. So let's see how to create this. So I'm going to go back to the data table we just opened up from our Journal. And the secret to making this is you want to have two columns of information to compare. You want them to be over the same scale. And so female and male monthly nominal salaries is what we're comparing here. They normalize all the countries' currencies in the US dollars. And so I should be able to use those in our graph. So I'm going to open up my Graph Builder. Start with the blank Graph Builder. I'm going to take female monthly and male monthly and put them both on the X axis. I'm going to put year on the Y axis. Now I'm going to copy both female and male monthly, and I'm going to move those into the interval. Now it's actually doing the job, but it's hard to see because we have so many countries represented for each year and their intervals are all on top of each other. So let's go under the red triangle where it says Graph Builder. And let's go ahead and do a local data filter. And let's go ahead and just look at a certain country and we are going to look at France. Now I've got France here and it's starting to look like my interval chart. I'll say done here in the Graph Builder to close my little control panels. Most of the things I'm going to have to change here are going to be directly on the graph, such as I'd like to get the ends of these intervals, the points to be bigger. And so to do that, I'm just going to click right on the legend, right click, you go marker size. I'm going to go other, I'm going to make it a big ten. So you see how much bigger it looks now. I'm going to do the same thing with male monthly with marker size I'll go other and I'll go 10. Pretty cool. Now, what about the bar? I'm going to right click, I'm going to go customize, and it's the second error bar you're going to see is the one that's going to show up on top here. And the line style is fine with me. I'm just going to make it gray and I'm going to increase the line width to a two or three. I'll do this one on a two. And now I get the view that I like. Now, a couple of things I'm going to do in the X axis. I'm going to right click on the axis settings, and I'm going to put a pre average salary of $3,500 monthly. I'll say okay. And now I have that line driven. So I can kind of see if things are increasing or decreasing over the years. But it looks like my years are going and kind of from bottom up. I like them to go from top down. So I'm going to right click under axis settings. And I'm just going to reverse the order on the scale. And you can see now I'm going for 2010 down to 2019. I click on that. And now I can truly judge what's going on. Now, one other thing I thought was pretty cool. If I switched the colors here, if I switch them, if I make the female the red and I make the male the blue, then I can bring in a picture. And I've got this kind of great picture you can see a small little icon of it that I think will look cool in the background so I click on it. You can't really see what's going on. It's there. All I have to do is right click, go to images, size and scale, build graph. And there it is. I'm going to right click again and go back to images. And I'm going to go transparency, and I'm going to make that a point three. And now you can see it's kind of more muted in the background. So you don't want to bring in background pictures if you're going to create a complicated chart or you're going to switch between a lot of different maybe filter options or have many different panes open because it can get tedious to have to keep resizing the picture. But it's great for stand alone chart. So here we go. We got our stand alone chart. I am going to close this one down and show you we have them available to you directly as a link in your data table. And the one I was looking at was without the picture. But I did a couple side by side. And this kind of let me look at France versus Germany versus Sweden. And I can see in France, the gap is not as necessary, as bad as I thought it was, there's a bigger wage gap in Germany, but in Germany, everybody seems to be making more money. But if you go to Sweden, it's the females that make more than the males. And so I thought that was very interesting. So here's some good information and data to play with. This data came from the International Labor Organization. And I will put all the links into the data table so you can find them where they are housed publicly. All right, that was our most popular view. Let's take a look at our second most popular view. That would be how do you get a word cloud out of Graph Builder a lot of people asked about this. And all you have to do is we're going to have to create some ordering columns that's going to help us figure out, let JMP figure out how to display my words within a cloud- like shape. So I'll show you what's going on. So I'll open up this educator for top five COVID words. So they went out... An education article went out in the state of Kentucky and they just said, "What are the top five words you think about when you hear COVID?" to school teachers, right? School teachers have been by far adversely affected by the pandemic, having to try to teach the schools as they close, as they reopen, as they do it with mixed media you have to do some classes live, some classes on site, mixed classrooms. So it's been very difficult. And you can see some of the words that have popped out get captured here in this column and you can see how many respondents said those words in their top five. Like 20 of them basically said... The word anxious showed up to 20 of the respondees. As a top five word associated with the pandemic. So I have that information. So I created two columns. The first column is a random column and it's just random normal information being assigned to the rows. And I got it just from creating a column, right clicking, going into the column info initializing some random data versus initialized data. You can do random. You can pick a random normal. I figure if you have a mound shape, right, a bell shape distribution, it's kind of like a cloud. Most of the stuff is going to be in the middle and less is going to be out toward the end toward the tails. So I did that and that's how I came up with this random data. I'll go ahead and delete that one. For the order column, it was pretty easy. I just sorted by weight and you can see so anxious was 20, then 17, then 17. So basically the order is just a row sorted order. So anxious is number one because it was biggest and that's going to help me later make the word cloud. So let's go ahead, click on our... Graph Builder. Let's go to... Before we do anything with the word, let's actually go to the random. And let's put random on the Y and it's doing what we expected. It's distributing the rows out randomly on the Y axis, just kind of separating them out. What if we swapped those points for the word itself? So if you go to points under the red triangle and you click on it, let's go ahead and select set shape column. When I click on that one, I'll go ahead and click on the words and there's the words. I'm going to take the weight of the words and I'm going to size it. Oh that's starting to look like a word cloud, right? Now I'm going to take that weight again and maybe give it some sort of scale coloring there. So I guess the more red, the more frequency we had in that word, which is kind of cool. Now what's going on is it's doing a jittering and it's doing centering grid jittering. That's actually the automatic default is the center grid so that looks pretty good and look when you're done as you move it around, it will... it will try to adjust the words within that shape and try to hold that type of scale. That's really easy to do. I'm going to right click under the red triangle for Graph Builder and under legend position I'm going to go inside bottom right. It may be inside bottom left. I think I like that one better. I'll put it right to the inside bottom left and I'm going to go to legend settings and I'm just going to keep the bit that talks about the color so that's pretty cool and that's how you would do the random word cloud. How I would do the ordered one is I'll go back to the red triangle under the Graph Builder I'll say show control panel. I will swap out random for order. Now all the big ones are on the bottom and the smaller ones are at the top. That makes sense. Maybe I'll right click on this axis setting. You saw me do this earlier. Go back and reverse the order so it's going from 0-30, with the ones and the twos and the threes being more on the top. There we go. That's exactly what I wanted to see and now I have an ordered one. Now with ordered low, a lot of times it might make sense. Here if I really wanted to have it in order ed order, I need to use a certain jittering that's going to kind of justify the ordering. So I'm going to show back the control panel. I'm going to go to the jitter style instead of center grid, I do positive grid and now it really is ordered, going from left to the right, top to bottom with the number one thing, then the number two, then the number three and then the next line, the number four, and so on. And sometimes people prefer this kind of word cloud, because... it makes it easier to judge which one's bigger than the other if the size and the color are similar. Now you can make a judgment. Now it's looking good from what I want it to do, but it's not looking good on the graph. What are we going to do about this one? So I really would like to move this whole thing over to the right. So to do that one, I'll take my control panel back open. I'll go under this X axis. And even though there's nothing there, I'll right click. I'll say access settings, and I will do a negative point three for the minimum. And now it's moved over. And now this looks a little more like an ordered word cloud. And of course, you can bring in pictures as well. And I did that on this one. I brought in a nice apple with a little picture with a little bit of transparency on it and make the words pop out on top of it. All right, so instructions are there if you want to try this one. But I can see a lot of people having, again, a bunch of words, a bunch of categories, a bunch of phrases. And if you got a weight on them, got to count on them, you can make a work cloud. All right. So let's see here where we are. We are ready to go to the moving average smoother chart. And this one's pretty cool. It's a new smoothing line option that lets you kind of look for trends. And a popular way to look for trends is a moving average. When there's a lot of noise in... the things that you're plotting over time, then sometimes you want to smooth it. You see that blue line here is smoothed in between the ups and downs of those points that are being collected over the years. The other thing it's doing here, as you can see I don't have a legend. I've actually labeled the lines. I'm going to show you how you can do that well as well. All right, so here I'm going to click on this tree loss. What we're looking at here is reasons for loss of tree cover or deforestations. This came from the Global Forest Watch, and they have Tree Loss in hectares. So we'll go ahead and show that one. Here's tree loss in hectares and I'm going to throw my year down here. And I've got both points and smoothers automatically showing. So all I really do is just got to take the drivers and overlay by the drivers. That's pretty cool. Now I've got just the s pline method and I'm not as big on that one format showing a trend. So if I click on the options, here's where you have more options in JMP 16 and I can do the moving average. I'm going to do the moving average. I'm going to move back this little local width toggle so it fills out the graph. And I'm even going to include this confidence fit. So this is looking pretty good. So I'm going to say done here. Now I'm going to go under the red hotspot for Graph Builder. I'm going to go to the local data filter. I'm going to add drivers and I'm just going to look at the first three drivers. Okay, so these are looking pretty good. Now I've got this legend way over here and it's not really adding to the graph. I would like to do something better. Well, I'm going to right click right on the legend line for in this case the agricultural shift. And I'm going to go to label, and this is brand new in 16. See, I can do min maximum values, first values. But I can also do a name. Look where it puts it. It looks it to the right in the graph. I'm going to do the same thing with each of them. Each of the ones I have open. Now I'm going to go under the red triangle for the Graph Builder and go to show and turn off the legend. I don't need it anymore and I don't want them on the side. Look what happens when I start to move it back. You can see I can move it into the body of the graph. And I'm going to stretch this out a little bit on our screen and you can see agricultural shift. If you put it close to the line, it tries to hug... the slope of the line which is really cool. I'm going to put agricultural shift there. Maybe I'll put commodity driven just over here and maybe forest driven right in this area here. And now you don't need the legend. Now you can just see what's going on with the light and kind of let people's eyes go where they have interest in the agricultural shift. Here was the real story, how this was more of an assignable cause of loss of tree cover, but it has gone down a little bit here in recent times. All right, so pretty cool graph as well. So smooth line moving average. Points cumulative summary chart. So this was cool data. This was actually data on driving safety. And kind of the idea was if we could point out over the years times when there's been major releases in car safety, are we getting safer? Right? As more vehicles are on the road, we expect there to be some more crashes just from the volume increasing. But are we actually having less injuries as our cars helping us with airbags and anti lock brakes and these things? And we did this with a points chart. And we really benefited from using cumulative sum options that are new in JMP 16 right off the points chart. You can always create a cumulative sum column very easily in JMP. But we think this is better. I'll show you why. So if I go to this motor vehicle safety and this came from the US Bureau of Transportation Statistics, I have... crash rates and I have injury rates. And the rates basically have taken the total divided by the vehicle- miles in millions. So you kind of get an index. So I'm going to go open up Graph Builder. I'm going to take crash rate and injury rate and put it on the Y axis. I'm going to put year on the X axis. I'm going to pull off the smoother line and just leave the points. And you can see why this is not so great, because what's going on right now is I see a trend. I see a trend of crash rates kind of going down over time, which is good news. But is it different than the slope of the rate of change going on by injury rate? Scale can make it hard for us to make that judgment. And you're just trying to compare patterns here and they're not next to each other. They're ones above the other one. It's kind of hard to see. So with one selection here under summary statistic, there's a new cumulative sum. Now tell me which one you think has a steeper slope. And the crash rate is going up at a pretty good cumulative growth rate. But you can see the injury rate is kind of leveling out a little bit. It is going up, but not as steeply. So I could summarize that the vances are definitely... We're still getting into a lot of crashes because there's more cars in the road. But it seems like we are... saving ourselves some injuries here and that's a good thing. And so the fun thing I was able to do, you can see back in the data, I was able to put in some... safety innovation. So like in 1996, we had side impact testing. We had dual front airbags in 1998. So we can go in and go to our a xis setting. And this is a good place for a reference line, 1998, we'll put a dashed line, maybe a dash gray line, and we will say air bags. We'll add those and there you go. And now we can see where the air bags came in and what the performance was. Now the other thing we can do is instead of using points, we can go to the red triangle where the points are. And we can use the shape column here as well for something that's not even categorical, for something that's continuous, like year and it will still bring in those values. I'll say done. I will make these a little bigger with marker size. Maybe... make them pop out a little bit more on our screen and make that a little bigger, maybe even tuck in on the legend position. Put this on the inside left, and by the way, you see where these highlighted areas are on my screen, you can put them in the corners or you can put them again out on the side of the graph. Just drag them there. But there we go. And now we can kind of see how these early advancements have really done the job to protect us, so even though there's more of us and there's more traffic and there's more cars and there's going to be more accidents, just hopefully less serious ones. Very cool. All right. So hopefully you're enjoying these like I am. Remember, I'm giving you this Journal so you can go and recreate them at any time. We will show you the fifth most popular chart, which is the advanced box plot. Now, on this chart, this is using some JMP 16 new features that allow us to really help the Graph Builder graphics to give us a lot of ways to visualize the box plots. And also, you're going to see that this is a good way we can do some interactive labeling as well. So I take this data that I found from Nestpick. It's climate change city index. And it was quite interesting. They came up with a total climate change risk and it went from 1-100. And they base that off of climate shift, of temperature shift, and potential sea rise, and... ...water stress, and that was the one I really hadn't thought of, like, will there be water around your town in 30 years? So in 2050, it said that these were going to be the cities based on total risk that were the most going to have the most problems. So Bangkok was number one, Marrakesh was number ten going on down. So I turned on the labels for these rows so it will pop up in our graph. So let's take a look at it. So I'm going to go Graph Builder. I'm going to take the total climate change risk on the bottom. You can see there's already starting to be some things getting labeled from having asked for labels in my data on those rows . I'm going to look at it by region. So I got three areas here. It's looking at it by region. And now I can say, well, let's make this a different color. Let's make this red. I want to right click there, right on the points on the Legends and make it a little bigger here. Now the points are standing out a little bit. Now let's hold our shift key down and let's add in a box plot. Now this is a typical box plot. It's not that interesting. You might notice Amsterdam showing up twice and Bangkok is. That's because the box plot is labeling the outliers. Well, the points are already represented so we can turn that outlier box off. I don't like this view low. I've always kind of liked the saw with views, but I've got points showing up well with this open box plot. But if I change the box type, not the box type. Actually, the box style, to solid, all of a sudden it's gray, it covers it up, and I don't see the fences of the whiskers kind of where my normal spread of point should be, I lose that. Well, not exactly. If I go to box plots now and I just click on the red triangle, I can do notched and I can do fences. So the notch actually kind of notches the figure right where the median is. My boss likes to call this a torpedo. Looks like a torpedo. And then I can add the fences back to see where the ends of my whiskers are from my box plot, which really does help. Now I'm going to say done. Now you're going to say, "Well, Scott, it's still covering up the points." When I make this really big on our screen, all you have to do is right click on the graph, go to points and say, you know what, move that forward. And there you go. And now I can see where all... my top ten at risk cities fall on this list. And it was interesting. I heard some surprises that I didn't expect. Like, I thought that New Orleans here in the US, because it's so low and so prone to floodings, especially from natural disasters like Hurricanes, it would come out higher. 2050 maybe Boston is not a great place to be. So that was really interesting to me. Fun data to play with. And this is some fun things to do with points and box plots. All right, so we are nearly out of time. I will just quickly show you the last one is this wind rose chart. And the wind rose chart is a way of looking at vector information. In this case, you see all these little arrows here. Those are all measured wind speeds and directions that came out of looking at the Great Lakes area around Chicago. And so they wanted to summarize it. So they were able to come up with a special type of pie chart. It's called a coxcomb chart that actually will let you kind of mimic a compass. So the compass rose is kind of what you're used to seeing on like compasses on the face, right? Yo u know you get North, South, East, West. Well, we've kind of doing the same thing with the wind rose. So it's kind of summarized all this data. So to make it, it's pretty easy, we just go ahead and we've got our positions for our wind. We have the speed of our wind, but we also have, in this case... the direction of the wind. And this one went into the 230th degree of the compass so that would be a west southwest. So when we go on the graph it all we have to do is take those sections that we've identified, pop them on down here, ask for the bar chart, ask for the coxcomb chart and I'm going to take the speed and I'm going to put it in the overlay and now it's easy to see that a lot of my data is coming out of... I put that one right there. Now I can see a lot of my data is coming out of the western section, the northwest in particular, especially with the larger orange segments where it's got a higher count and of course I brought in a nifty... background map and so you can make it look really cool and if you want to learn how to draw those arrows I've included those as well. That is something you would do under points, set shape, expression and I'm showing you how you can put in just a little of JSL scripting and you can draw these wind directions and the length of the arrow is the strength of the wind. That's pretty cool. All right, so we are right at time and I'm going to include in your Journal where to learn more so you can learn from the other galleries, you can learn from the other blogs and journals, the other presentations as well as other tutorials just off the JMP community so please do learn more about Graph Builder and please do share your data at the JMP Data for Green. All right so please email me or contact me if you'd like to talk more about Graph Build or see any of these views differently. And I thank you and I hope you enjoy your discovery and please do go help save the planet and get curious and share your results.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

JMP provides many data importing methods, allowing you to retrieve data from large databases down to simple text files. But what if your data is in a unique file format that JMP cannot directly import? By using the powerful but little-known JSL Blob (Binary Large OBject), you can extract that valuable data for use in your analysis. This presentation offers a case study in developing an import function for a common but obscure data source: the metadata locked within a digital photograph. Modern cameras record much more data than just the image. Metadata -- how, when, and where the photo was taken -- is locked away within its binary EXIF tags; with JSL Blobs, we can import that data into JMP. From there, we can use JMP's analytic tools to measure metrics of our photo library, and even produce a map-based display of our photos. Hello, I'm Michael Hecht, and I'm here today to talk to you about importing binary data in JMP using JSL. I'm on the team that develops the software for JMP, and let's get started. JMP can import lots of different file formats, everything from plain text to Excel spreadsheets. When those are imported into JMP, they're shown to you as a data table that you can then use for further analysis with all of JMP's capabilities. But what if there's a data type that you can't import? JMP doesn't know how to do it. In this case study, I'll be looking at the JPEG image format that is pretty common amongst all digital cameras and smartphones. I'm sure everyone's familiar with it. In fact, you might be saying, wait a minute, I thought JMP can open JPEGs, and in fact, it can open a JPEG. But when it does, you get an image like this. But there's more in that JPEG than just the image. For example, if I get information on this one, I see what kind of device the image was taken with, what lenses were used, and even the GPS coordinates of where I was standing when I took the photo. Now, how can we get that data imported into JMP? Well, we can do it through JSL, which I'll get to in just a minute. But if we open this file in a text editor, we see that it's not human readable text. It's a series of bytes that are shown as these unprintable characters, and we call that binary data. It's data that's outside the range of normal alphabetic text. It has a structure, though, and the data is locked inside there. If only we can determine how to unravel it. To do that, we need to know the specification of how this data is laid out. In the case of JPEG, that's defined by a specification called Exif or Exchangeable Image File Format, and we can download the spec for it. It's a document that's been around for about 20 years, and it's in use by all the devices that produce JPEGs. Not only hardware devices like cameras, but even Photoshop puts metadata in a JPEG in this Exif format. To access it from JMP, we need to use a JSL object known as the BLOB or Binary Large Object. This is just a JSL object that holds a sequence of bytes. Like the name says, that sequence of bytes can be large. We can actually create a BLOB by loading the contents of any file on your hard disk into it using this Load T ext File function. Normally, that would return the contents of a file as text, but if we add this BLOB keyword as a second parameter, then the function returns a BLOB. We can take one BLOB and subset a part of it into another BLOB using BLOB Peek like you see here. This is taking 50 bytes from b, starting at offset 100. Now the offset for BLOBs is always starting at zero, so the first byte in the BLOB is an offset zero. Now we could do both of those operation in a single function call by passing the offset and length as parameters to the BLOB keyword when we call low text file. If you see here that says, open the file at this path, skip 100 bytes in and then read 50 bytes, and return that as a BLOB. Once we have a BLOB, we can convert it to a character string using the Blob To Char function. Here we're taking b2 and converting its bytes into a character string, assuming that those bytes are in the " us-ascii" charset. If we don't specify a charset, JMP assumes it's UTF-8 . We could also consider that BLOB to contain a series of numeric values, all of the same type and size. Using Blob to Matrix here, we're taking b2, which we read in and set to be a length of 50 bytes, and interpreting it as an array of unsigned integers, each of which is two bytes long. We should get back a matrix with 25 numbers in it. Now that fourth parameter, the string "big," says that those unsigned integers are in big endian format, meaning the first byte is the most significant, is the highest part of the end, and then followed by the lowest part of the end. We could also specify the string little to specify little-endian format. Binary files have both of these kinds of representation of integers and other numbers. In fact, the Exif format uses both big-endian and little-e ndian. Let's take a look at these operations in action. I'm going to switch over to JMP, and I'm going to open this demo script here, D emo number 1. Now we see some of the code we just looked at. Here we are loading a text file, this file named Beach.j peg. This is the same file that I used in my slide. It's right here and you can see it. You can also see that it has a size of about 3.4 meg. When I run this one line of code and log, it tells me that b was assigned a BLOB of 3,45,000, et cetera bytes, or 3.4 meg. It doesn't show me all those bytes, but I can see how big it is. I can get that length using the Length function, just like you use for character strings. But when I use Length on a BLOB, it gives me back the number of bytes. I can get a sub- BLOB of the first six bytes in b using BLOB Peek. We'll do that here, and I see a BLOB of six bytes was assigned. I can actually look at the value if I want by just submitting the name of the variable. I can see here are those six bytes in this "ascii-hex" format FF-D8-FF-E1, et cetera. I can take these six bytes and convert them to a matrix, and I'm going to convert them to a matrix of two bytes unsigned int or shorts in big-e ndian format. Given that there are six bytes here, I should end up with a matrix of three numbers. When I run this, sure enough, there's my three numbers. We can see those three numbers in Hex just to verify with this little four loop that I wrote, so let's do that. There they are, just same as before, FFD8, FFE1, 0982. So now, let's look at the next four bytes following those six in the file, and we'll get them in a sub- BLOB all by themselves so that we can then convert th em to a character string using BLOB to Char. When I run this, I get the character string " Exif." You may have noticed in the slide showing the binary file contents that that little string was up there near the top, and it's part of the Exif file specification and identifies it as such. Let's go back to the slides. Those functions are powerful. They let us do what we need to do to manipulate and read data from a BLOB, but they're a little cumbersome to use. Let's write our own utilities to make them a bit more manageable. I'm going to start with a function that I've named Read Value and it takes a BLOB and then some offset within that BLOB, and then the numeric type I want to read and the size of that type. It's going to read one value out of the BLOB. I passed my BLOB and offset and size into BLOB Peek, get back a sub- BLOB of just those bytes, and then call Blob To Matrix passing in the type. I use the same size, so the size of the BLOB and the size of an element are the same. I should get back a matrix of one value and I pass in "big" because I'm just always going to use big-endian format. But I don't want to return a matrix. I want to return that one value, so I pull it out of the matrix and return that. This is called like so. I call Read Value, I pass in b. I read one unsigned int starting at offset zero and it's two bytes long and I get back that value FFD8. There's a problem with this code though, and that's in this parameter b. B is that BLOB that's 3.4 meg in size. The problem is that JSL, when passing a BLOB to a function, always passes it by value, meaning it makes a copy of it. For every single number I want to pull out of my BLOB using this function, it will make a copy of that 3.4 meg just to pull out two bytes, or four bytes or whatever, and then throw it away when the function returns. That's inefficient, wasteful, and probably really slow, so we don't want to do that. How can we get around that? Well, instead of passing it as a parameter, let's put it in a global. We'll make a global that we load with our BLOB, and then we can call the function and it'll just refer to the global instead of a parameter. In fact, we can make a bunch of globals. We can record the length of the BLOB to offset to where we are currently processing data in the BLOB, and maybe even for the endianness. The problem with globals, though, is that they are in the global symbol table, and they might interfere with other code that you have. In fact, we'd like to write our importing code as something that can be used by other clients, and those clients might have their own variables by these names, or they might be using other code libraries that would interfere. How do we get around that? Well, I've done it by creating a namespace. I call my namespace " EXIF Parser" Now, instead of globals, I put them all as variables inside my namespace, and now they're N amespace globals with that prefix. Before I call my function, I need to initialize them. I'll load the "Beach.jpeg" file into the EXIF P arser BLOB, I'll record its link. I'll start off my offset at the very beginning at zero, and I'll set the endianness to "big," and then I can change my function, simplifying it a bit like this. Now, I've actually put my function in the same namespace Read Value as part of the EXIF Parser namespace. Now, all I need to do is pass a type and a size. BLOB Peek now uses the global BLOB that's stored in the namespace, the global current offset that we're reading from, and BLOB the matrix even uses the endianness, so we can parameterize that, we can change it. Once I've retrieved the result I want, I'll increment the offset by the number of bytes we just process, and then I'll return it as before. Let's see what that looks like in action. We'll look at Demo 2 and here's my namespace. Here are my globals in that namespace. Here's the Read Value function, just like we saw. I've got some more functions. I've put an EXIF Parser Here's Read Short, which just cause Read Value, but it always passes unsigned integer of two bytes. Similarly, Read Long reads and unsigned int of four bytes. I've also got Read Ascii which you pass it a size in bytes and it makes a sub- BLOB of that many bytes at the current offset from the global BLOB, and then cause BLOB to charge to convert it into a string. It's using the "us-ascii" char set, because that's the charset that the Exif specification says all of its character data uses. Then just like with Read Value, it increments to offset past the bytes that were already processed and returns the string. Let's submit all of this code so that those things are all defined and then we can try to use them. First, we'll initialize our EXIF P arser globals, and then I'll read the first three shorts from the file just like we did before. But now, I'm going to call Read Short. We're starting it off at zero, so I'm going to call it three times in this loop. It will read in each successive short, advancing offset as it goes, and then print them out just like before. There they are. Now, our offset is sitting at offset six just past the last thing it read. I can call Read Ascii for four bytes and I get back that same string. Okay, so let's go back to slides. Now, we have some tools we can use to start building our EXIF Parser. We need to dig into the specifications to see what does the EXIF data format and this JPEG file look like. Well, at the top level, it looks something like this. It starts with two bytes, which are what's called the start of image marker. We've already seen those two bytes. They're the value FFD 8. If your file doesn't start with that, it's not a JPEG. Then there's a series of blocks of data, and each block starts with two bytes, which is a marker, then two bytes, which is a size, and then some data which is however many bytes the size of there were. Now the size also includes itself, so really the data is size minus two. You can see there, are a bunch of different block types defined, but some of them are optional, some of them can be repeated. The ones that we care about are APP1 and APP2. That's where the EXIF data will be. Then there's a bunch of others that we don't care about. Eventually, we see one called Start of Scan or SOS. When we hit that, we know that the next part of the file will be the actual image data, which is the pixels. When we hit that, we can stop. Then after the image data is end of image. We need an algorithm to read this data. Here's what we'll use. First, we'll read the first two bytes, see if it matches the start of image marker, and then we know we have a JPEG. Then we'll have a while loop, where within the while loop each time through, we'll process a single block. To do that, we will save the current offset position, read the two bytes for the next marker, and if that marker is SOS, we can break out of the loop. Next, we read the two bytes for the size. Now we have all the information we need to process the block. Whatever that entails, we'll do it. Then we can skip past the data in case processing the block didn't change our offset at all, but we'll explicitly move our offset to whatever it was at the beginning of the loop, plus the two bytes for the marker, plus the value of the block size. When we get out of the loop, we either ran out of data in the block to process or we hit that SOS marker, so we're done. Let's see what that looks like. Demo 3 has this code. You see, we have our namespace, and all this is the same as before. I'm going to run it just to make sure everything is defined. Now, we're adding a new function, which I've also put in the namespace, and it's called EXIF Parser:Get EXIF Data Raw, and I'm passing in the path to the file, that JPEG file that we want to process. Now, I've defined an associative array here that maps those magic marker codes to their abbreviations so that we can print them out on the log. I load up my Exif Parser globals like before, only now I'm passing in the file path that I was given, and then I start interpreting what the data is. First, I look at the very first short and make sure it's at the start of image marker. If not, I just return because it's not a jpeg file. I'm going to write to the log that I saw it at offset zero. Then here's my while loop to walk through those blocks. At the top of the loop, I'm going to reset my endianness to "big," because some of the blocks, when we process them, will have their own endianness and change it to little. We want to know that the endianness is big at the top of the loop because the block structure always uses big-endian data. Then I'm going to save whatever the current offset is, and then I'm going to read the next marker. It's a short, and I'd look to see if it's equal to SOS, which is that magic number. If it is, I can break out of the loop after logging that I saw it. Next, I'll read the two bytes for the block size, and then I will process the block. Now, in this example, my processing consists of writing a message to the log, so I'll do that. Then I'm ready to skip past the block. I do that by changing my offset to be whatever it was at the beginning, plus the two bytes for the marker, plus the block size. When I break out of the loop, I reset my global s and I'm done. Let's define that function by submitting this. I'll run script, and then I can call it passing in "Beach.jpg," let's see what we get. It printed out to the log. It offset zero, there's start of image, then it offset two, there's APP1, and it has this size 2,466 bytes. Then we get APP2, which has about 30K of data. That's most of it right there. Then we have a bunch of blocks that we don't really care about, but eventually, we see SOS so we break out of the loop. That's all working well. Let's go back to slides. Now, I'm going to skip ahead in processing some of this file format just for the sake of time. But if you download the paper that's associated with this talk, the full code is there and much more detail. I highly recommend that you do that, but I'm going to give you a flavor of it here. What we do next is we process each of those blocks that we have read, and some of them we can ignore. We want to filter out the blocks that do not contain excess data. Then the ones that do, we need to do its own parsing. What we've discovered when we look into the Exif specification is that these blocks contain their own set of blocks of data called Image F ile Directories or IFDs. Then those contain individual metadata information with tags saying what the data is and then what format it's in, and then the data itself. We want to collect all of those things together into these lists. There'll be lists of lists of lists, a somewhat complicated data structure. But the list data structure is very generic and JMP and can hold all kinds of data, so that's what we want to use. It'll have those metadata items tagged with these numeric values that we call the raw tags, but we want to replace those with actual human readable labels that identify what they are. Let's look at this and JMP, and I'll look at Demo 4 . Now, at this point, I have taken all my code in the "EXIF Parser" namespace and put it into its own file. Now Demo 4 is a client of my code, which is an "EXIF Parser.jsl." I can just include that. Now, the function that gets the data, I can call it passing in the name of the jpeg file. Now I've extended this function in here to actually process those blocks, and break them down, filter out the ones that are not EXIF all the things we just talked about, and give us back that data structure. Let's run this and see what we get. Well, we get a lot of numbers, some strings. You see, this is in lists within lists here, and there's an outermost list, and then it contains different items which are lists. Each of those have these pairs of values. There's a number which is the raw tag, and then the value. This one's a string. This one is also a string. This one's a number. This one's a matrix. The data can be different types, but the tags are all numbers. What we want to do next is convert those numbers into human readable items. In fact, this whole list, we want to convert it into an associative array that indexes the data by keys. The first thing I'm going to do is define a mapping from these numbers to the human readable keys we want to use. That's in this long associative array right here. Now, I'm actually using the Hex values for the keys, because that's how they're specified in the Exif specification, and it makes it easier to follow along when you're looking at the spec. There's a bunch of those, and I'm going to start at the bottom and work my way up. Down here at this function, Label Raw Data. I passed in this whole data structure that we got back from parsing the BLOB. Here's the definition of Label Raw Data. I'm going to return a list as my result, so I'm going to walk through the list as my input and use it to build up the result list. I use this For Each construct, which is a new modern JMP function that walks through a list, and for each element of the list, it pulls that out into this variable raw exif, which I'll pass to this function. Then I want to append it to my result. I do that using this Insert Into line. I'm inserting it to the end of result, and I have to use Eval List to overcome something that JMP is doing to be helpful with list creation. Again, there's more detail about this stuff in the paper. It's worth downloading and checking into. But for here, we're just going to look at the call to Label EXIF, and that's right up here. Label EXIF is going to do a similar thing where it's going to walk through each of these tag value pairs. Instead of returning a list, it's going to return an associative array. Here, we are initializing results to be an associate array. That's what this token means. It's an empty associative array, and then we'll return it at the end. We'll also use for each to walk through the list, and we know that each raw item is going to be a list of two elements. We get the first element, which is the raw tag, and then the second element is the data. We simply build up our associative array by adding an item keyed by the raw tag with the value of the data. Pretty simple, except we don't want the key to be the raw tag. We want to transform it using our lookup table. That's what Get Tag does, and that's defined next up here. It simply takes the tag id we pass in, converts it to hexadecimal. This will give us back a four-character hexadecimal string. We need the right most four characters from that, and then we look it up in our ifd tag array up here. I'm going to submit all this code to define it, and then we'll have it call Label Raw Data. Here's the result. You can see that it's similar to before, except now, instead of the topmost level list, it doesn't have another list, it has an associative array. This top most thing is a list of associated arrays. This first one we can see that the raw key whatever the number was, got converted to DateT ime and there's its value and so on. But we've noticed the second one looks like it has tags that didn't get converted. Why is that? Well, it's because this key ExifIFD has as its data. Actually, another IDF, yes, this is a recursive data structure that's defined in terms of itself. If we want to label the things inside here, we have to change our code to label recursively, and we'll get to that in a minute. But before I leave this, I want to show that I'm going to actually combine these two steps into a single function that I call Get EXIF data, where I first get the raw Exif data out of the BLOB, and then I label it, and then I return the result of that. Let's define and run that, and it should be exactly the same as what we just saw. Sure enough, it is. I'm going to close this and go back to slides. Yeah, skipping ahead again, as I mentioned, we have to do our labeling recursively. As I mentioned, some entries in our metadata or IFDs have as their data another IFD. That means we have to call our labeling routine recursively. The way that I do it, is to use this JSL built- in Recurse, which calls the current to function, and you can pass in separate parameters for the recursive call. There's more details on that in the paper, which I'm sure you've already downloaded at this point. Now, the one thing to be aware of is that some of these embedded IFDs, most of them use the same look-up table that we already defined, but some of them have their own lookup table. We have to make sure we're passing the correct look-up table with its own definition of tags to our recursive call as we're going through the different levels of recursion. Then once we have a fully labeled data structure return, we can extract pieces from it to get the things we're interested in. We can run that over a whole series of images and collect all that data into a data table or some other format. Let's look at that in JMP. We'll pull up Demo 5, which here I've rolled all of that recursive labeling into my Get EXIF Data function. I'm going to include my namespace code and then run that function. Now we're getting back our fully labeled data structure. You can see that now this EXIF ID has labels in it. There is this big block of data of numeric stuff, and we look and we see that that's in this thing called Maker Note. Maker Note is a special extension to the Exif specification that allows the maker of a particular device, in this case , Apple, the maker of an iPhone 12, to embed their own proprietary data. In some cases, a camera manufacturer might reveal what they've embedded there. In other cases, people have sort of guessed at it and come up with their best guess. That's the case with Apple. There's some things that are known and other things that aren't known. You'd see a lot of this is just untagged. But some things we can see, acceleration vector, and runtime, and whatnot in there. Anyway, we're going to ignore that for the most part and look at what this thing contains. I can see that it's an associative array in that first element, and that's where most of the things I want to deal with is I'm going to pull that out into its own variable right here. It has 14 elements, so we can see what those are, what their keys are like so. There are those keys. If we want to pull something out like "Model," I can do that simply by subscripting into exif1 "Model" and I can see it's iPhone 12 Pro. I can do the same thing to get the date time, and there it is. But this is in the date time format that the Exif specification defines. That's not a format that JMP recognizes, but I can use JMP's in format function to convert it into something using this format pattern option, which is a modern JSL thing that lets us specify the pattern of the date time data, and JMP will convert it to a numeric date time, which it recognizes as such, and formats it for the log. That worked. Now, I'm also interested in the GPS coordinates, and that's in this GPS IFD part of the EXIF. It is one of those that is itself an IFD. Let's access it, and then we can see what it contains. It has information about the altitude, differential, image direction, latitude, longitude, and speed. What we care about is the latitude and longitude, which is these four things. There's latitude, there's longitude, and then they have these associated Ref values. We need all four of those to compute the coordinates. Let's start with the latitude. We'll pull that out into its own variable, and we see it's a list of three elements. If we look at what those three elements are, we see that there are three vectors with two numbers in each one. The Exif specification refers to these as rationals, and it uses them a lot. But what we want is actual numbers instead of these rationals, numerator ,denominator. We can convert them using this Transform Each function, which loops across this list and processes each element after putting it into a local variable r. We want to process that by dividing the denominator into the numerator, and then transform each builds a new list of those results and puts it in our variable, that being, like you see there. Now these three numbers are the degrees, minutes, and seconds, but we want to combine them all into a single value for JMP to use. We have to add them together, scaling each component appropriately. Do that there and we get the value. Now, if that's in the Northern hemisphere, we're fine. But if it's in the Southern hemisphere, it needs to be negative. If we look at that "GPSL atitudeR ef," it's either N or S, which tells us if it's S, we need to negate it. We can do the same thing for longitude. Its LongitudeR ef will be either E or W. E is positive east values and W is negative west values. Here we see we had to negate the longitude because it was in the West. If we dump those two numbers out, we can use JMP's built- in formats for latitude and longitude, and we can verify that they match what the Finder can pull out of the file as those coordinates. We can see that it's North and West, so it's in North America. Now I'm going to skip to Demo 6 , where we put all this together. I'm going to use that code to pull information out of a whole bunch of photos. In this folder, I have 16 of them, and they're photos that I've taken at previous JMP Discovery Summits in Europe on years back when we used to do them in person. It'll put all the info it finds into a data table. I'm going to run this and here's our data table, and we can see that I've captured the names of the image files and the timestamps. You can see it goes from 2016 to 2018, and the Lat and L ongs are North and East, so that's Europe. I can even see the progression of various iPhone models I had across those years and how their lenses improved over time. I've set this up so that I can select a row and click this Get Info table script, and it opens a window for me that shows me the photograph and the metadata for it that I've captured. I even have a button here showing Google Maps, so I can click that, and up pops a Google map of that location. It's right there with the red marker. I can see if I z oom in that this is the Hilton Amsterdam. That's where we had the conference in 2016. That all seems to work well, I n this case, I'm going to add the photos themselves to my data table as an expression column. That's what this Add Photos table script does. For larger collections, I would not want to do that because it's actually making a copy of those photos into the data table. But for 16, it's fine. For thousands or even hundreds, you'd probably not want to do that. But it's also set my new column to be a label column. Now I can run this Explorer script which opens a graph builder of those latitudes and longitudes. I can see some points here. Here's Amsterdam. There's a photo we just saw and here's some more. That's definitely Amsterdam. These are Brussels. Yes, that's Brussels. Over here, we have Frankfurt. That year, we got to go to the cool supercar Museum. That was pretty neat. Over here we have Prague. I'm going to use the magnifier to zoom in on Prague a couple of times. A t this point our detailed Earth background is not really helping us much so I'm going to switch to street map service. We can see, yes, that's definitely Prague, and here is where we rode the historic street cars up to Prague Castle Here is some JMP attendees crossing the bridge to Prague Castle. I can see John fall there in the distance. Over here, we have a very nice reception that we held in the municipal hall. Here's me checking the time on my Apple watch against the orlloi to make sure that it's right. That all seems to be doing what we want. In conclusion, I want to touch on the things that we learned. We learned about the JSL object BLOB, which is a good tool to have on your tool belt for manipulating arbitrary binary data. We use that to build up a little application for importing files. Along the way, we learned some things about namespaces, and JSL recursion, specialized list handling, and some modern JSL things like Transform Each, For Each. Ag ain those things are covered in much more detail in the paper. But most importantly, I think is that we saw a case study of how to take a difficult problem like a complex file format and break it into smaller subtasks that we could conquer. That's a skill that we all have and need to make use of in our professional work. But I think it's many times helpful to observe someone else doing it and pick up tips and tools and techniques that we can then use in our own work. Now I want to turn it over to you to take these tools and use them to import and write the code to create your own binary files and solve your own problems. Thank you very much.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 7, 2022

0 attendees

0

Event has ended

0 attendees

0

Monday, September 12, 2022

For many applications, JMP is like a blank slate, waiting for you to decide what to do. However, sometimes it’s better to start with more than a blank slate. Sometimes you want a guided experience. For some time now, JMP has had guided experiences and wizards for such tasks as opening spreadsheets, importing data and accessing databases. But it’s time to take it to the next level for more general applications: Action Recording, introduced in JMP 16, has been a great enabler to bridge the two worlds of the interactive and the automated. It allows you do the work interactively and then use the recording to make a script for the same work the next time. JMP 17 takes that to the next level, wrapping the recorded actions into a workflow environment where you can edit, customize, debug, generalize and save without ever needing to see the underlying code. Julian Parris demonstrates the new Workflow Builder in JMP 17. For engineers new to designing and analyzing experiments, JMP can now guide you through all the steps of the experiment. Joseph Morgan explains how Easy DOE makes this process easier than ever before by providing a guided workflow in a wizard-like unified environment that helps make all the best decisions for you, sparing you from the burden of having to figure it all out. There’s also a middle path for applications such as medical review of clinical data. The standardization of the data model, with CDISC, enables a semiautomated path to the analysis. JMP Clinical begins a new era with a much faster, more flexible, more full-featured environment for analyzing clinical data. Once you have accessed and refined your data, and produced a compelling analysis, what is the natural next step? Sharing your findings with others in your organization. JMP Live makes this natural next step easy and powerful. You can share an interactive JMP analysis with your colleagues, even if they don't use JMP themselves. In this way, you can use JMP Live to help drive decisions in your organization. In a JMP Live space, collaboration permissions can easily be turned on to allow you and your trusted colleagues to see, download, or even edit and replace each other's data and reports. These flexible collaboration spaces can increase the speed and the ease of learning in your organization. Whether your work can be planned ahead or is a wild path of exploration and serendipity, JMP has the modes to make your work speedy, efficient and productive, while keeping you in an environment of discovery.

0 attendees

0

Event has ended

0 attendees

0

Monday, September 12, 2022

According to Wikipedia, an unconference is a participant-driven meeting designed to encourage attendee involvement in topic selection and knowledge sharing. Join us for these lightly structured discussions to share your ideas and learn more about JSL scripting. We had some great discussions and much needed time to reconnect and network with peers and colleagues from around the world. We mentioned many resources during the sessions for learning JSL, parsing report outputs and more. Below are the many of the content items that were mentioned or utilized during the sessions. Like minded folks similar to yourselves have a question, know the answer, or just want to hang out and learn follow the discussions. Others want to be part of a group where they can teach and share their knowledge and resources. For you we recommend the JMP Scripting Users Group. The rich JMP Scripting Language (JSL) lets you work interactively and save results for reuse. It even allows you to develop new functionality to solve problems that core JMP does not address. We´d like to help you to take full advantage of JMP and its automation capabilities. To do so we´ve created the JMP Scripters Club. A community of JMP users who are eager to learn and leverage the JMP Scripting Language and to share their knowledge with fellow JMP users. Do you enjoy cooking? Or are you like me where you love me appetizer? Maybe your are a fan of the full course meal or just want a sweet snack? Then hop on over to the JSL Cookbook where you'll find tasty and delightful recipes using the extraordinary ingredients available from the JMP Scripting Language. @Wendy_Murphrey and @_jr showed us how to FLASH ( our reports) and then parse them with XPATH queries. We learned that JMP is using version 1 of XPATH. We saw on the fly how to implement an query to dynamically pull estimates from the report output. Namespaces oh my! @drewfoglia shared options for launching JMP platforms with pre-populated selections. He also mentioned namespaces and I refer the reader to the Namespaces online documentation for more information as it relates to the topic. Drew also shared his paperObject-Oriented JSL – Techniques for Writing Maintainable/Extendable JSL Code and another great resource was mentioned during the session Essential Scripting for Efficiency and Reproducibility: Do Less to Do More (2019-US-TUT-290). Flows—sharing shortcuts and developing repeatable analyses are available in the upcoming JMP 17 release. We heard from @Mandy_Chambers about workflow builder. I bet you like me are really excited to get our hands dirty with this great new capability. Looking to learn JSL? Head on over to the Learn section of the community where you can find free on demand courses, discount codes and more. JMP Education shared some great news about our first free on demand course Introduction to the JSL Scripting Language, available immediately. We heard about certification exams and how to receive a special discount code that enables you to receive 55% off the certification exam. Looking for syntactic sugar or have literal value dilemmas check out the informative information provided by @Joseph_morgan2 He shared a brief refresher from his JSL Tutorial that he delivered as part of Discovery Summit Europe in Prague. He also reminded us that the JMPer Cable has excellent articles available for consumption in short snippets, for example, Expression Handling Functions: Part I - Unraveling the Expr(), NameExpr(), Eval(), ... Conundrum. Joseph also shared a link to Using JSL to Develop Efficient, Robust Applications. If you were lucky enough to attend the in-person or virtual event please leave us a comment below and let me know what you enjoyed most.

0 attendees

0

Event has ended

0 attendees

0

Monday, September 12, 2022

We lose approximately 920,000 shelter animals to euthanasia requests every year. Instead, these animals could’ve made 920,000 families happier. We would like to explore the current data from Austin’s animal center to understand what conditions lead to a euthanasia request and if measures can be adopted to prevent them. The data is sourced from Austin’s Open Data Portal and consists of two tables - intakes and outcomes, dating from Oct 1, 2013, to the present. Intakes represent the status of animals as they arrive at the animal center while outcomes represent the status of animals as they leave. Each animal in is identified by a unique Animal ID. Each table consists of 136K data points and 12 features. We first explore the distribution of data by various categories such as breed, gender, age, and intake condition. Finally, classification models like logistic regression and random forest classifiers are used to make predictions on whether an animal will be euthanized. Understanding the key factors like their intake condition, sub-type of euthanasia, breed, and age could unveil crucial insights into understanding the causes for these animals to be put down and consequently advise on where to target funding for research and facilities. Hi, my name is Shalika Siddique My name is Anand Manivannan and we're both students from Oklahoma S tate University and we currently pursuing a business analytics and data science degree. Today we are presenting a boaster where we explore euthanasia in animal shelters, and we hope to understand why cats and dogs are being put down. Every year we lose about 920,000 animals annually. Using JMP Pro, we would like to identify the key factors, that lead to euthanization of cats and dogs. Once we identify these key factors, funds can be channeled to relevant sectors to prevent euthanization of animals that could have been saved. In addition to this, we aim to make predictions to identify which animals are most likely to be euthanized. A little information about our data set here, we source the data from Austin's data portal, and the animal shelter that we use for analysis is located in Austin, Texas. Overall, we had about 130,000 records. After cleaning and filtering, we focused on about 67,000 records that were specific to cats and dogs. Prior to our analysis we explode the data set and attempted to derive insights. We use JMPs, graph builder to create visualizations such as bar graphs. From the 67,000 records, there were about 3,171 animals which were euthanized. Which is about 4.7 % animals of the shelter. In comparison to animals surrendered to the shelter by the owner, stray animals were most prone to euthanasia. When we compare the age of the animals, we notice that kittens under the age of 15 months, contribute to 25 % of euthanasia, while pups contributed to 13 % of euthanization This bar graph here, is an example of one of the visualizations created using JMPs builder. The lavender bar represents cats, while the purple bar represents dogs. We can see that intact males followed by intact females are more prone to euthanization, in compared to neutered animals. Next A nand will go over in detail over the modeling. Thank you Shalika. Yes, I'd like to talk a bit more about our approach towards modeling using JMP. Before we could start modeling, we performed a few data preprocessing steps to prepare our data. We did things like standardizing units for certain variables, such as age, which was in weeks, months and years. We wanted... We converted that to just months. We bend on the age variable so we could convert it into a categorical variable. It looked in the... It looked like age ranges like 10- 15 and 15- 25. We grouped rare breeds and colors to reduce the number of categories. Additionally, we also filtered just cats and dogs from all the other animals that went through the shelter. During a modeling phase we noticed something very peculiar. We noticed that class imbalance in our target variable, which talked about whether an animal was adopted, and whether an animal was euthanized. About 64,000 records, out of 67,000 records were adopted animals, and only 3,000 animals were euthanized animal. Since a model was to focus on predicting euthanasia, we had to resolve this issue, and hence we used JMPs Bootstrap model and Boosted Forest to resolve this issue. It used the concepts of bagging and boosting to do this. Since bagging and boosting models don't really give a lot of room for interpretation in terms of what the variables do, we used logistic regression to interpret these variables as well. After modeling, we tuned up parameters to get the best results. We chose a few certain metrics, to choose the best model based on its performance on validation data. We used a 70:30 % validation split, and prior to modeling, we also tested the assumptions for logistic regression. Or over to the top right, you can see that we tested for multicollinearity and independence among variables using JMPs contingency analysis, which spread out a muse plot, and gave us a P and correlation value, that basically told us which variable was correlated with each other. Now I'd like to dig a bit deep into each model and how we selected our models. Over to the top left, you would see that we chose metrics like specificity, this classification area under the Cove and R-S quare to choose which model performed the best. These metrics were chosen for a particular reason that aligned with our goal. Our goal was to predict which animals would want to be euthanized. The cost of our model, incorrectly predicting a euthanized animal, as a non- euthanized animal, would mean that animal would probably die and not be saved. Hence we wanted to focus on increasing the accuracy of euthanized animals and reducing the misclassifications. Hence these particular metrics, were chosen First we ran the nominal logistic regression model, which you can see over to the bottom left. The Log worth immediately gave us which variables were the most important in predicting euthanasia. Turns out it was sex of the animal intake condition, intake type and outcome age . A lot of these are not surprising, and it matched with what research shows. The whole model turned out to be significant as well, the P- value less than 0.001 Following that, we ran the Bootstrap F orest model, which was tuned to have a hundred trees and feature selection criteria value of three Bootstrap. We used receiving operating characteristic or the AUC curve, to determine which classification threshold gave us the best classification results. We ended up using 0.1 or 10 % as our classification threshold. Over to the right, you would see that we ran the Boosted Forest model, with parameters of 87 layers and a learning rate of 0.179 Over at the bottom, we used the decision matrix for all three models to calculate the specificity of each particular model. Which you give us how accurately the euthanized animals were being predicted. We also use misclassification rate and R-S quare from the overall statistic tab of JMP. In every metric, we found that our Bootstrap model outperformed the other models, and hence we chose that as the winning model to make predictions on euthanasia. Next, I would like to go over, some important results that logistic regression gave us. With regards to sex, we found that intact cats and intact dogs, were way more likely to be euthanized than neuter spayed animals. With regards to breed, we found that mixed cat breeds, and Pit Bull dog mixed breeds, were more likely to be euthanized than all other breeds. With regards to age we found that cats that are 4.5 to 6 years, are more likely to be euthanized than younger cats. Dogs under 1.2 years are the least likely to be euthanized. This was widely surprising because it's contradictory to what we found during a data exploration phase. Similarly, with regard to intake type, we found that older surrendered animals, are twice more likely to be euthanized than stray animals. This is completely, again, contradictory to what we found in the data exploration phase. That goes to show that what the power of statistical analysis and unbeating true facts. Next, I will be handing it off to Shalika again to go what recommendations we can make to these animal shelters. Thank you Anand. Based on our analysis, we have a few recommendations that animal shelters could use to lower euthanizations. We believe that animals taken into the shelter should be neutered or spayed This is in accordance with medical research, which proves that intact animals are more prone to diseases. Animal shelters could also use our Bootstrap Forest model to prioritize which animals needs to be saved, in case a difficult decision needs to be made. In support of that, here are some recommendations from Austin's animal shelter. This particular shelter would need to prioritize cats over dogs as they are more prone to euthanizations. With regards to age, cats aged between 4.5-6 years, and dogs over 1.2 years would require more attention. Owner surrendered dogs need to be prioritized over stray animals. Finally, when it comes to breeds, Pit Bull mix dog breeds and mixed cat breeds, are more prone to euthanization and would likely require more attention. That brings us to the end of our presentation. We hope that animal shelters could use this analysis, to reduce the need for an animal to be euthanized. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, September 12, 2022

The crime rate in the United States is increasing every year, and this is something that needs to be addressed. A key objective of this paper is to identify factors that statistically impact the crime rate in each state and leverage that information in order to reduce crime rates. For our analysis, we will make use of datasets: U.S. Census Data and Uniform Crime Reporting Program Data collected by the Federal Bureau of Investigation in 2014. We were able to get crime statistics for all 50 US states along with a detailed breakdown of crimes and input variables such as income and literacy to test their impact. Additionally, we intend to identify correlations between the types of crimes so that we can understand the core issues and identify crimes that may influence others. As a result, the allocation of resources and optimization of results for crime reduction would be improved. The findings of this paper will help us to understand the various socio-economic and locational factors that influence crime and, possibly, break certain stereotypes. This could be amazing for government bodies in constructing rules to combat crime in their areas. We were able to find that weapon and drug-based crimes had a high correlation with the other crimes. After testing the various factors in determining the crime rate in any given state, the top 3 were Weapons owned, Literacy Rate, and the percentage of people who follow a religion. Good afternoon. Today we're going to be talking about the understanding of crime rate and crime prediction. Before we go into the data, let me introduce a team. We have Karanveer, our data scientist and data modeling expert, and myself Grant Lackey as a data researcher and data visualization specialist. So before we get into the data, let's do an overview of the entire presentation. We're going to begin with the background which will be the initial data sets and why we chose our data. The data overview, which again is going into the reason why we chose our data and what we're going to be trying to answer. The business problems, which is the problems that we had with our data. Why we're trying to answer certain questions and the overall idea of the entire project. Next is our methods and plans, which is our procedure of answering our business problem and then our results, which are the results of our methods' plans. Our applications, which are real- life implications from our results, and post-analysis, which is what we could include or add on to our results. What we could add on to improve upon this and years to come. Beginning with background, why should we care about crime rate? Well crime is just important to everyone, and it's everywhere in the United States, and so what is crime rate and how can we define it? How we define crime rate is our initial criminal activity, divided by the population density per county or per state. We're mainly going to be looking at per state. So how are we going to identify factors which can reduce crime rates throughout our entire project. Here we're going to be speaking about how certain crime is going to be more influential in certain states than others, and do certain crimes influence other crimes? So for example, if there was a murder crime, would guns or theft be more influential in that murder crime or would other crimes be influential in that? So looking at our data overview. We started off with our initial data set which is our crime statistics data, and we added other data variables later on throughout this initial data set. Beginning with our initial data set, we started with 2014 data, and this initial data set was given to us from Federal Bureau of investigation: the FBI. We looked at 42 criminal activities which are wide range from murder, theft to drug possession, drug activities. We looked at about 3,200 counties, and within all these counties or within these states would be all those counties. We had to look at 48 states. We had to exclude Florida and Illinois, because Florida and Illinois did not provide data to the FBI for the criminal activities. If you look at future 2018 data or past 2012 data, it's the same issue. They just don't provide data to the FBI it seems . With all of this for 2014 data, there's 180,000 data points talking about the FIPS codes. This is how we identified certain criminal activity in certain counties. For example, we have our state codes, which would be 01 for Alabama. These states are represented alphabetically. Alabama would be the first one, and then each county within that state would have numbers to them. For example, Baldwin would be 003. If you looked at Baldwin, Alabama, it'd be 01003, so on and so forth for every county detailed in the state. Looking at our extra variables, we looked at the census data. Census data is always great for checking out the age, population, income per county or per state, and we had to look at other data sets like gender, immigration, religion, marriage, unemployment and literacy rates. These other data sets looked more or so at the statewide rates, and this isn't really related to criminal activity, but we wanted to involve it within our initial data set to see if there's any correlation with them. Going into our business problem, we want to answer what states in the United States specifically have the highest and lowest crime rates and why is that so? To answer our business problem, we have to answer these business questions going into that. How can we identify variables that influence crime? Which are the most important factors? Are there crimes that influence other crimes? I'm going to hand it off to Karanveer to talk about plans and methods. Thank you, Grant. Our approach to solve this business problem, was to come up with a regression model. We have used JMP to make it. First, as Grant mentioned, we have connected the various databases, that is the crime data set, along with those extra variables such as religion, income, etc. We have made sure whether the data looks clean. A fter that, we have run our regression model, which is able to predict the crime rate for us. With this, we are able to know the various variables and their importance in determining this crime rate, and we are able to list them by their importance. A t the end we'll also be showing you visualizations based on it. As Grant mentioned, we had 42 criminal activity variables. Some of these variables were very small, such as drug possession, drug consumption, drug sales. In that case, we have simply grouped them to make sure that we can come on a conclusion on that since the data was otherwise too small for the subgroups. We'll be looking them state wise as we didn't have the extra variables on a county basis. But I feel that this is great for starting this project. Our target variable would be the crime rate. We have defined the crime rate as the number of arrest in that certain population. Now, coming down to the variables that we are using. Most of these variables have been normalized and we have used a percentage for them, such as immigration for gender. We will be using two types that will be a male and a female, and then religion, unemployment, marriage, literacy. Most of these are normalized so that we don't have an analysis which could be misleading. Coming down to the final equation of our regression model. This is the equation of a model. We have rounded off the samples, and as we can see there are a lot of variables that have a positive influence, as in, that they increase the crime rate, and there are certain variables which have a negative sign with them. They basically decrease the crime rate. Using this we can see how we can define a crime rate in any state or county. Coming down to the results. The finding number one. We really wanted to see which states have the highest crime rate. These are the following five states. Tennessee, Wyoming, Mississippi, Wisconsin, New Mexico. Then we have the following states with the lowest crime rate, that are New York, Alabama, Vermont, Massachusetts and Michigan. Here is a following visualization explaining how the crime rate varies across United States. As we see, there is no certain pattern and it's all over the place. Finding number two. Using JMP and doing a log [inaudible 00:07:57] on the variables, we could basically see which variables have more importance. The number one was weapon owned, followed by literacy rate, then religion percentage, immigration, population density, and the unemployment rate. I think this is a great finding, while any government body or any organization wants to allocate resources whenever they are trying to reduce the crime rate or trying to analyze it. The finding number three is something really interesting. Our goal was to see whether there are certain crimes which could help us solve not just that crime, but maybe other crimes as well. Which these crimes are trying to influence. Drug and weapon was one of them. We could see drug and weapons have a very high correlation with say, theft, robbery, murder. Using a chi- square test, we saw that the correlation is very high. So in case any organization would want to focus on and start with, I think drug and weapon is a great category where they can focus at for reducing crime rate in any state or county. This is the following map showing the religion rate, weapons owned, and literacy rate, and the variation across United States. If we put it with the crime rate, we can see a certain pattern which is actually explained by our regression model. Now coming down to the implications, how we can use our analysis to a real- world solution. Like the data set we have used, and we have connected to variables, we would definitely want to work with governments, towns and communities because crime is a universal problem and this is something everybody wants to reduce. The restore allocation can be done according to this, and further, this would result in a decrease in crime rate and increase in happiness in the community. Post- analysis. There are a lot of things that we would want to include in our project, and this is a great future scope as well. First thing, we would want to include more variables such as weather, ethnicity and the list goes on. We could definitely even listen to the government bodies and take inputs for these variables from them. County detailed or at least city detail. I feel it's great to start with state- wise data, but we would definitely want to focus on a more detailed level of analysis, so that we can use these conclusions to the real world more clearly, more precisely and we would have a better impact as well. The data time frame. Right now we have used the data from the year 2014. I feel this is an eight year old data set. We would definitely want to use a more latest data set, and something that is spanning over a couple of years, so that it gives us clarity. Since COVID has impacted us in a lot of ways, and it has changed how basically lives are working around us, and so has crime rate and the way crime happened has been changed. We would definitely want to focus post COVID , and over the last two- three years, for a post- analysis. That's all and thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, September 12, 2022

Heart disease and strokes are two major diseases that have been around for years without a cure. Heart disease is the leading cause of death in the United States, resulting in one death every 36 seconds. Of these deaths, one in six people die due to a stroke, which is also the leading cause for long-term disabilities. For our research project, we explore whether these two major diseases have common factors that can predict each other. First, we built a logistic regression model for each disease. Next, we made a new variable, which returns 1 if the person has both diseases and 0 if not. Finally, we did a final analysis to see which variables in these two models can predict both diseases in one equation. From our research, we identified that the variables general health, diabetes and health coverage are the most useful in determining whether or not a person will suffer from heart disease or a stroke in their lifetime. Hi, my name is Brittany Burlison, and my copresenter is... I'm Kailey Wilson. We are both second year master's students at Oklahoma State University, getting a master's in business analytics and data science. Today, we are going to present our research in what is most important in determining heart disease and stroke. We will be going over our research overview, the methods that we've used in our data overview, our data analysis, and results and implications, and what we've done in JMP. Heart disease and strokes are two major diseases that have been around for years, and there's still no cure for them. Heart disease is a leading cause of death in the United States. A person dies every 30 seconds from heart disease. Of these deaths, one in six die due to a stroke, and strokes are the leading cause for long-term disabilities. For our research, we are looking to see if these two major diseases have any common factors that will be able to predict each other. We are interested in seeing what factors are most important in determining whether a person will suffer from stroke or heart disease in their lifetime. We are wanting to take variables that correlate to the Social Determinants of Health to see what variables play a bigger role in determining these major health issues. For our data, we will be using for analysis is the data from the Behavioral Risk Factor Surveillance System, in short, BRFSS, from the CDC. This is a phone survey that collects data from citizens regarding a plethora of information. We will be using data from 2016 to 2020. This contains over 500 fields and over 2 million observations. Some of the fields contain information about households, current health conditions, behaviors and demographics. Additionally, some States have the option to be more specific health questions, and those are considered too. We will be looking at the variables that people are asked. For the methods and plans that we are going to use. Our data site contains over 500 variables, as we mentioned, so we have narrowed that list down to 11 that we have deemed the most important in determining heart disease or stroke. We have referenced the social determinants of health to help us make this decision on which variables we should keep. And we have determined a few that we'll go over in the next slide. So we're using JMP, specifically, the fit model resource in JMP and graph builder. The factors that we are considering is a person's sex, their age, and their race. So for our variable selection, we have determined that income, housing, education, mental health, health coverage, overall general health, smoking status, diabetes state, divorce, and medical costs were the most important variables to look at. We will be using stroke and heart disease as our response variables. We will look at these variables by gender using the sex variable. Then we will concatenate all five years of our data in JMP, run a fit model test to determine which preselected variables are the most important in determining heart disease and stroke. Kaylee will go over our data analysis and what we have found. Thank you, Britney. The first response variable that we looked at is heart disease. When sex is 1, that means it's a male. So as we can see in our output, that the most important variables, based on their log worth, were general health, diabetes, and if they were a smoker. Even though the RS quare is pretty low, which means that only 8 % of the data is explained by these variables, since the p- value is very small that means that the variables that we have selected are very significant. Same we can see over when it's a female. Similarly, the most important variables are general health, diabetes, and if they smoke or not. We can come to the similar conclusion that the RS quare is very low, which makes sense since there are 500 variables. But the variables that were selected are still very significant. Next, we wanted to look at what heart disease looked like based on general health. So general health was a variable that was split into nine buckets. One being excellent health and nine being very poor. So we can see that. When heart disease is one, that means that they had heart disease, and when it's two, that means they did not have heart disease. As we can see, when general health is two or three, which means very good general health or good general health, those two had the highest number of heart disease. Next, we wanted to look at stroke. For stroke, for a female, the most important variables, out of the variables we selected were diabetes, general health, and then education. Similarly, we have a very low RS quare, but our significance or our p- value is very small, which means that all of these variables are still very significant. Similarly, for males, the most important variables are diabetes, general health. T hen the RS quare for this one is the smallest RS quare we have seen, but we still have a p- value of less than... A very small p- value, which means it is still very significant. S imilarly, as we did for heart disease, we built a graph too based on general health, to look at where stroke fell in the general health response. And the general health it falls into is, again, two and three, which means people with very good health, or good health, are most likely to have stroke. Then we went and we created our own variable for when someone had a heart disease and stroke, they would return a value of one, and when they didn't have it, it would be zero. So here we can see for heart disease and stroke, the most important variables are general health, diabetes, income, if they smoke, and education. This RS quare is our highest RS quare, which is really good. This means that most of the data is represented in this and our p- value is still very small, which means that all of these are significant. Then again, we made a graph to see where the general health it fell. We can see that for when someone has heart disease and stroke, it falls in three, which is good general health. Our conclusions is, we found that the most important variables determining whether or not a person will have heart disease is general health, diabetes, smoking, and if their parents are divorced, and that was for the males. Then for females, it's general health, diabetes, smoking and income. Then I'm looking at stroke. For a female, it's diabetes, general health and education. In males, it is diabetes, general health and health insurance. Then for both of them combined, the most important ones are general health, diabetes, income, if they smoke, and their education. So drawing to a close, our overall implications. We would say, to help prevent heart disease, people should improve their overall general health, monitor their diabetes, decrease their nicotine use, etc. Then to help prevent stroke, people should improve their general health, monitor their diabetes as well, and think about improving their health care plan. Then overall, people should just focus on their general health to prevent heart disease and stroke and any other diseases. We believe that doctors and healthcare providers, if they take this into consideration, these are super important factors in determining whether a person will suffer from heart disease or stroke in their lifetime, and they will be able to provide better health care options to their patients. Additionally, we feel that if the general public take these factors into consideration, it can help reduce the risk of stroke or heart disease overall in the general public. We thank you for listening to our research, and if you have any questions, please let us know. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, September 12, 2022

With the rise in internet usage during the COVID-19 pandemic, it is no surprise that there was also increased popularity of online chess. In this study, we have investigated and analyzed low to moderately rated online chess players and the games they participated in. We utilized data sets from Chess.com in which we were provided with data concerned with individual players, clubs, tournaments, teams, countries, daily puzzles, streamers, and leaderboards. We utilized JMP and Python to complete our analysis. We took a random sample of low to moderately rated players from October 4, 2020 to March 4, 2021. We noted the portable game notation and specific moves completed by each player. When beginner-level chess players utilize certain moves constantly, they are more likely to see consistent wins, therefore increasing their status from beginner to moderate player. When looking at these moves on an individual basis it is unclear the impact on success, however when move combinations were further examined, the prediction of success was much more accurate. The results of our analysis allowed us to identify a series of moves most moderately rated players employ leading up to a game-losing move. With the COVID-19 pandemic occurring during the data collection, the data may be skewed. External environmental factors such as the pandemic may lead to inaccurate results and findings. This research and analysis aims to help chess trainers and coaches in better formulating strategies and training exercises to help beginner to moderately rated players improve their skills. Introduction: Chess is one of the oldest and most widespread sports across the world. With the introduction of new technology and increasing internet accessibility, people have been given the opportunity to play chess in virtually any area of the world. As popularity and access to chess continue to increase, it is important for players to understand the best way to improve their game skills. In this study, we will be investigating chess players with a low or moderate rating. Games participated in by these players will be looked at in depth to allow for us to better understand the blunders and mistakes that determine the results of games, and in turn, change player ranking. To first grasp the reach of our study, we set out to understand external factors that have affected the population of the chess community, such as advancing technology and the COVID-19 pandemic. The research provided will help players and readers to reach a better understanding of openings and tactics that are most beneficial to low and moderately ranked players when navigating online chess. In this case, low rated players will be defined as players rated between 800 to 1000, while moderately rated players will be defined by a range of rating between 1000 to 1300. Our research will be specified to six countries: Canada, Australia, United Kingdom, United States, India, and Bangladesh. The overall purpose of this study is to pinpoint consistent blunders and mistake patterns in moderately ranked players and utilize them to devise strategies that will increase competition and wins. This study will point to a direction in which an optimal winning strategy can be determined, and ultimately help online chess players change their player ranking. Data Overview: The data collected to complete this research was provided by Chess.com, one of the top online chess communities that offers players online chess games for free. We accessed the website’s API database, where we gathered data revolving around individual players such as their profile, titled players, stats, and online gamer status. In addition, we were also provided with access to specific games including current daily chess, concise-to-move daily chess, available archives, monthly archives, and multi-game PNG download. In order to complete a more accurate and in depth analysis, we also downloaded and utilized specific country data including the country profile, list of players in each country, and a list of clubs within the country. Access to data: https://www.chess.com/news/view/published-data-api / https://lichess.org/api Mined data Name of the Variable Description Username Username of both players Elo ELO rating of both players Result Result of the match ECO Code Unique code indicating the opening employed in the game PGN (Portable game notation) The entire series of moves in the game in a text format Generated Data Name of the Variable Description Blunder PGN PGN of moves leading up to a blunder Mistake PGN PGN of moves leading up to a mistake Method: The method in approaching this data first began with the cleaning and mining of the accessed data. The data was mostly clean when it was received, however, there were minor edits and changes that needed to be made in order to continue forward in the analysis process. After the data was cleaned, reviewed, and processed, we took a random sample of low to moderately rated players in the United States, United Kingdom, Canada, Australia, India, and Bangladesh from October 4th, 2020 to March 4th, 2021.. For each randomly selected player, we investigated five rapid games that the player participated in. As each rapid game was assessed, After our initial assessment and investigation, we utilized Python to form code that would allow for us to merge, join, and compare the datasets compiled for each country and its selected players. A random sample of 1000 games were selected from the pool of users in our target rating range of 1000-1400. These games were then analyzed using Stockfish at a depth of 20. Using stockfish evaluation of the position at each move we come up with a score indicating which player has a better position quantitatively. The unit used in such a score is called Centipawns. A score of +100 Centipawn signifies an advantage of 1 pawn of the white player over the black player. After each move a new score was calculated along with the change in score from the previous move. We define two classes of moves, a blunder and a mistake. A blunder means the move made by the player has cost them a 500 centipawn disadvantage while a mistake has a threshold of 300 centipawns. Blunders would create a worse position for the blundering player, leading to higher losing chances for the player. By identifying blunders and mistakes we generate variable Blunder_pgn, which would be a PGN string with the series of move leading upto the blunder Results: Using Blunder PGN and Mistake PGN we were able to identify a series of moves most moderately rated players employ leading upto a game losing move. We identified 3 Blunder and 4 Mistake PGN’s which players struggle with the most among all combinations at our target rating level. Pic 1: Scandinavian defense and its success rate Pic 2: Blackmar Gambit and its success rate Pic 3: Center Game and its success rate Mistake prone openings Implications: Black players should refrain from Blackmar Gambit and scandinavian defense. White players generally have an advantage but tend to struggle with the center game openings. While there are different openings with different problems the general trend of weak opening principles in blundering players is observed specifically: Pawn sacrifices without compensation Queen safety Development of pieces Conclusion: Moderate rated players play the most accurate when they employ standard openings such as London system and the Giuoco Piano Game, hence should be trained on these fundamentals first before moving onto complicated openings References: https://www.chess.com/analysis https://python-chess.readthedocs.io/en/latest/pgn.html https://stockfishchess.org/ All right, good afternoon, and today I'm going to be talking about the predictive analysis of online chess outcomes and success. My name is Allison Clift and I had the opportunity to work on this project with another another student in my business analytics program, Calbe Abbas Agaria, however, he is not with us here today. To begin, we analyzed low and moderately- rated online chess players. Since the COVID-19 pandemic, there was an increase in Internet usage as well as with the advancement of technology, people have switched over to playing online chess as it is more readily available to users. We wanted to look at the effectiveness of different game strategies, specific moves, and individual techniques, and their impact on potential wins or potential losses in the game of chess. Player data was pulled from chess.com, which is where we were able to view profile of the player, titled players, their statistics, and the online gamer status. We utilized JMP and Python to be able to complete the study. We noted the Portable Game Notation, also known as the PGN. This was used to determine the openings, blunders, and mistakes that were occurring during the competition. We learnt that looking at individual moves on their own was not as predictive as looking at move combinations as a whole. It was found that the prediction of chess was much more accurate when we looked at different move combinations. We were able to identify moves from moderately- rated players to employ leading up to game- losing moves such as blunders or different opening moves that led to more success. The analysis aims to help chest trainers and coaches in finding weak points and beginner to moderately- rated players to help them increase their player rating. They will also be able to formulate better strategies and training exercises to help these players improve their skills. Like I said, the increasing popularity of virtual chess really encouraged us to complete this study. We wanted to investigate and understand the differing game strategies employed by beginner and moderately- rated players. We wanted to determine the optimal winning strategy for these players to help them increase their rating on the online platform. We wanted to learn how to help these players be able to determine a specific strategy to utilize moving forward. To begin with our methods, we started by sampling the data we received from chess.com. After cleaning and mining the data, we were able to collect a random sample of players from the United States, the United Kingdom, Canada, Australia, India, and Bangladesh. Looking through our own research, we found that this is where chess was most popular in the past few years. So we really wanted to look at that data in specific. Specifically, we looked at the data from October 4th, 2020 to March 4th, 2021. We did this in order to avoid potential implications from looking at data that occurred during the COVID-19 pandemic when internet usage was at its highest. We also were able to do some feature generation. We generated two features which allowed for the users to determine move combinations that led up to blunders or mistakes. Here, we created the Blunder PGN and the Mistake PGN. The Blunder PGN was just the record of moves that were made by a player leading up to a blunder and chess. The Mistake PGN was just a collection of moves that a player made leading up to a mistake. This is what allowed us to complete our analysis. Next, we utilized a Python code to merge, join, and compare all of the data that we collected. This data was compiled of five games per player from about a 1,000 to 1,400- player rating. We selected 1000 games randomly from this selection of data. While we were looking at this data, we wanted to do... We measured it and using a stockfish depth of 20. To describe these measures a little bit more, it was measured in what we call a centipawn in chess. A plus 100 centipawn signifies that there is an advantage of one pawn of the white player over the black player. During a blunder, this means that a move made by one player has cost them a negative 500 centipawn disadvantage. A mistake is equivalent to a negative 300 centipawn disadvantage. A blunder is normally what occurs in a game losing mistake. Down to the bottom you can see some analysis that we conducted via JMP. In this graph right here, it is the top ten most used openings in blunders. As you can see, the number one used opening that leads to blunders is the Queen's Pawn Opening London system. Secondly, we look at the Scandinavian Defence that is oftenly used and this can be led to blunders as well. I will mention these again later in the results and the conclusions of our presentation. At the bottom you can just see two graphs. These graphs just show the number of wins that are occurring per level of player. We can look at the lowest- rated players up to the highest- rated players. These show just the average number of losses in comparison. Over to the right you can see the blunder flag which this is just the white player versus the black player. At the bottom is the list of frequencies that occur during these moves that are made to the left. For example, you can see when we look at the London System Opening, it is about half and half for white players and black players in the wins and loss ratio. However, when we look at the Scandinavian Defence, we can see that the white players often make blunders more often compared to the black players. When we look at our results using the Blunder PGN and the Mistake PGN features that we developed, we were able to identify a series of moves that most moderately players employ leading up to a losing move. We identified three blunders and four Mistake PGNs, which players struggle with the most among all combinations. For one, black players should refrain from the Blackmar Gambit and the Scandinavian Defence. The Blackmar Gambit only results in about 29.3% of wins for black chess players. Secondly, the Scandinavian Defence only equivalents in about 27.7% for players that are using the black pawn . White pawn players generally have an advantage here. They do struggle with center openings though. When we look at what moves and openings the w hite pawn players utilize when they move strictly forward in the center, they tend to lose games more often. Lastly, weak openings and blundering players. There were a few openings that we were able to identify that consistently led to blunders in both players. These were pawn sacrifices without compensation, queen safety, and the development of pieces. While we look at all of this data together and all of our results, we were able to come up with a conclusion. Moderately- rated players are most accurate and successful when they employ standard openings. They should be trained on the fundamentals of chess before learning how to move on to complicated openings. Some of the openings that we suggest that beginner players start off with are the London System, and the Giuoco Piano game. At this time, I would just like to thank you guys and I will be accepting any questions that you have over the report.

0 attendees

0

Event has ended

0 attendees

0

Monday, September 12, 2022

0 attendees

0

Event has ended