Abstracts

0 attendees

0

Monday, October 16, 2023

The request for quotation (RFQ) process can be optimized using machine learning algorithms to predict the price of pre-engineered metal buildings (PEMBs) based on historical data. This process involves collecting data on PEMBs, preprocessing it by removing missing values and handling outliers, selecting relevant features through feature engineering, and splitting the data into training and testing sets. Popular machine learning algorithms such as linear regression, decision trees, random forest, or neural networks can be used to build the model, and performance metrics like mean squared error (MSE) and root mean squared error (RMSE) are used to assess the model. Python and JMP are two tools that can be used for this predictive modeling. Python is a popular programming language with data analysis and machine learning libraries. At the same time, JMP is a commercial software package for statistical analysis and data visualization. Choosing between the two depends on specific project requirements, user skills, and available resources. By building a predictive model for PEMBs, organizations can more accurately estimate the weight of future projects, leading to better budgeting, planning, and resource allocation. Hello. My name is Najmeh Dehghani. I'm a Data Engineer at JLL. I have my Master Degree in Analytics from Texas A&M. My capstone project is in RFQ Process Optimization for Pre-engineered Metal Buildings. I get to know JMP and SAS in school at Texas A&M. I founded a wonderful tool to do analysis and perform analysis with that. But actually, my capstone project is in both in JMP and also Python. I have done it in both. Basically, the objective of my project is to predicting the weight of pre-engineered metal building by considering various geometric and geographic requirements. Let's have an introduction to the metal buildings. Metal buildings, basically pre-engineered metal buildings are structures that utilize steel frames to support the roof and wall panel. We can do the customizations on them based on the customer's requirement. They can be commercial or residential. They have an accuracy because they are pre-engineered and the dimensions are precise. They have flexibility. We can change the design based on the customer needs and customer requirements. What exactly on this project we have to do is that estimate the weight to predict the price. The steel structure is a critical factor on the metal building pricing. We want to see if we can predict that weight. Eventually, we can predict the price and eventually, we can work on the RFQ process. Basically, let's take a look at the RFQ process. What is the RFQ process that we want to optimize and what is exactly their current system is doing. As I mentioned, the business challenges on hand on this project is to improve the RFQ process for metal buildings and to see if we can improve the response time, reduce the server cost, and mitigate the missing opportunities for the company. To send out the RFQ to their client, basically, they are going to choose some process, some steps. They review the inquiries that coming to the company, specify the purpose, size, layout, design requirement of the building. They check all of the technical specification, the load capacity, roofing and insulation and other factors to prepare accurate quotes that cover all of the materials, labor and cost for them, calculate the price and send out the RFQ back to the customers. Their current system that processing the RFQ for them basically is like this. They are right now having three servers. Each server is processing about 480 RFQ per day. The time each RFQ will take to process is about 180 seconds. The cost per RFQ is approximately 75 cents. Dividing all of these together, we can come up with a number of $0.004 per second to process RFQ. To consider 480 per day per server, it's going to be $360 per day to process the RFQs. They have three servers. Totally, it's going to be about $1,000 per day for all three servers to process the RFQs. If we consider 365 days in a year, it's going to be about $400,000 server and maintenance cost for the company to process all of the RFQs. Basically, we want to see if we can improve this process for the company by predicting the weight based of the historical data. The methods I have used on this project probably based all of the machine learning projects for predicting. It's going to be data collection and preprocessing, feature engineering and selection, and model selection. Let's take a look. We started by defining exactly what is the business problem. The business problem is that predicting the weight of the metal building. The data collection and pre-processing is that how I handle this part is that about 182 SQL tables were provided to me as a part of subset of historical data from the manufacturer. Around 500,000 record was included in those tables, more than 500 features, US data only. I have performed the ETL process and cleaning my data set fine abnormal data in SAS, Python and JMP and integrated all of my data in one table with the help of these three tools. My next step was to, the feature selection or extraction. The preliminary feature selection, I performed it starting industrial study and also getting the expertise judgment to eliminate the features that it was really unrelated to our target. Then I started correlation analysis to examine the correlation and find the multicollinearity between my features to eliminate more features. Then descriptive analysis and visualization to check all of the relationship between my data as I have done in JMP, Python, and Tableau. The next step for me was to start figuring out or exploring the models, the algorithms based on the characteristic of my data. As my data was, my target is an integral target, so I started to look into the linear regression stepwise, random forest, decision tree, and some other more algorithm. Eventually, I trained and evaluate different models in both Python and JMP. I compared the performance metrics and did the cross validation and k-fold validation for all of my models. If we take a look at here in the middle, we can see the models that I trained my data on it. Some of them are in Python, some of them are in JMP. To compare this model to see which one is the best, I had to compare the models based on the ASE, and also to check which model is performing better and also check the R-squared to see how close our forecast, our prediction going to be to the actual data. This is the residual plot for all of the models. If you go back here, you can see out of all of the models that I had run, I have two models that are the best, a decision tree in Python and boosted tree in JMP. If we take a look at both of them, we can see most of the features that both selected are the same, even the importance are close to each other. Looking at the ASE, we can see the ASE for decision tree is about 4,000, but for the boosted tree is about 2,000, which is much more better. Also, R-squared for the boosted tree is a little bit higher. The residual, if we look at the residuals from both softwares, we can see the residual is really good, close to the actual values, or from Python is close to is around zero values. Both model are really good and they're predicting good. It means that we can predict the weight of the building based of this historical data that we have. Let's see how it's going to impact the business here. Okay, we have a good model. This model is performing well. Let's see if the business want to implement this model, what are the benefits for them? The first chart we can see was the chart that I already presented to you as their current system. We know they are each server, they have three servers and each server is processing 480 RFQ per day that totally it costs $360 per day for each server. Consider that we want to process the same amount of the RFQs, 480. But at this time, with implementing the model, we can lower down the process time to one second because the model going to just instantly running and giving the price to us. Considering the cost going to be per RFQ going to be the same, cost per second going to be the same, just processing time will be one second out of 180 seconds. The total cost will lower down to $2 per day per server. Then they have three servers, it's going to be $6. If we consider, it's going to be 365 days per year that all three servers are running, it's going to be about $2,000. Comparing with current cost, they are paying for the server and processing time. It's almost 99% cost saving for the business and the company. But surprisingly, this is not the only cost they can save. The other cost they can save is or they can bring more opportunities. They are missing some opportunities. Right now, they are processing 480 RFQs per day, one server. But if we lower down the processing time to one second, they can process 80,000 RFQs. Considering the win rate for them will be 4%. Right now, the total number they can win per day is going to be 19. But with implementing the model because they are processing more RFQ, so they can win more and they can bring more opportunities. Actually, basically they are missing about 3,000 opportunities per day. Probably this is going to be more than their capacity, but this is one of the points that they can think about their business development. This is my project about they could have predict a good model and eventually we can see they're going to be a huge business impact for the company. I hope you enjoy it and thank you so much to Texas A&M and Dr. Jones as my capstone coach.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Disease often shows a predilection for specific site locations. To demonstrate the body regions involved, researchers may use infographics or tables. While tables may be more descriptive and universal to any statistical software, heat maps rapidly communicate disease involvement across a cohort. At our institution, software to portray data into the confines of a human body is not readily available. Additionally, we needed the human body to be segmented into specific regions. While a graphic editor can be used, it is labor intensive, limits generalizability, and cannot be repeated with different data sets. To create a scalable solution, we used the Custom Map Creator in JMP, an add-in to Graph Builder that allows users to create custom shapes. With this program, we outlined a human body, created anatomic regions, and colored each region based on the percentage of occurrence. The selected color range is scaled to directly correlate with the data. Each anatomical subunit is then filled and shaded by the data in the respective column. Heat maps provide easy interpretation and clearly portray the most involved areas for rapid communication to readers. Today, I want to give a little presentation about using the JMP Custom Map Creator and how we use that to develop a human heat map. I'd like to give a shout out to my mentor, Dr. Benjamin Caffender, who helped push this project along. A little bit about the Custom App creator, it allows you to create custom shapes that can be displayed in Graph Builder. What this allows you to do is to take statistical data and percentages in particular, and present them in a color-coordinated and color-referenced presentation. The way that this is used, it comes with some built-in maps of the United States or different places in the world, is to be like, oh, in this state, this percentage of whatever is taking place is occurring. Gives you that heat map approach, but it also allows you to create different objects as well. For us, this was a dermatology project, so we had cutaneous disease, and so we wanted to show lesion involvement on what part of the bodies were affected by the disease. So we had face, 90 %, we had scalp, but just under 20 %. That thing, we wanted to show this dynamically. We ran into what's the best way to go about it. We did some digging. We found this custom app creator to do that. What we did is we took an outline of an image, and you can see all these dots. We went in and we traced this image and created our different body areas, and JMP will set those into specific shapes where you can input data that will appear in those areas. You can take whatever picture you want. This is is an example of one that we used. Then that data will get incorporated into those areas. So you can see over on the left of our image, really to show the face has significant involvement with the darker red, and that's correlated up with the 90 % involvement versus lighter areas such as the trunk or upper extremities. But the custom upgrade lets you create those areas and then put those inputs and put those percentages in very easily. Rather than me talking about it, I'd like to show you a little demo of how it works, and we will get going from there. I already have JMP open. Once you install this add-in, it's going to add-in map shapes and then custom map creator. We put custom map creator, and we get these different data tables that pop up as well as our map creator itself. So what you do is you can paste an image in. I have a file over here that I will just drag and drop in. You can copy paste however it may be. You can adjust and zoom and that thing to get it exactly how you want. Give it a name. Now you start drawing, you start tracing. I'll just quickly click through here. That's one shape. We hit Next Shape. You can change the names of your shape to whatever it may be. When we were using it, we had different body areas. If you look over on the right at the data tables up here in this table, this example X, Y, we're seeing the different, the very specific data points or dots that are being graphed. Over here, we're having these shape IDs be placed. You can see up here, I'm working on shape five. You hit Next Shape. As you put that in, you get these next shapes. These are linked between these two data tables. Now we have, let's say, I've finished my image, I have everything graphed out, you can hit finish, and that will just save it, or you can also finish with random data, which I think is an easy way. It's an easy wayto show it, but also just to actually have the data show up in the right place. Hit open. Now we have... This is our... This is our graph. We can see this data they gave us over here, shape one, I'll just drop this down to one over to shape five we'll put that at 100. You can see over here, shape five, 100, this is our red and has the most involvement. Down here, shape one, I guess we have a couple on the lower and this is the darker blue. But you're able to get that stratification. So what you can do is you can either have your own data table, or you can just put in your own data into these shapes as well, and it will adjust in real time. What's really nice about this add-in is you can go right here and you can change these gradients. Right now we have a red to blue. You can change it from this yellow to red. We can have that higher volume with the darker colors and less involvement with the lighter colors really easily. The two big things as you're doing this that are important to know is if this is your own data table, this shape, this name column and the column properties and map role, we see here this map name data table is connected to our example name. Jmp, which that is one of... That is this file over here. It is super important that those have to be linked. If you hit generate random data, it does it for you. If you have your own data table, you have to manually set that. But what you'll notice is if you manually create your own data table, here this shape 1,2,3,4,5 matches up perfectly with shape 1,2,3,4,5 here. Those names have to be exact and those roles have to be exact. If you do that, then you can have this data pop up really easily. You can also get these data tables through Graph Builder as well. But yeah, that's just a quick overview of how to use this add-in. It's very versatile. Like I said, we're able to use it for our project. Yeah, I hope this gave you just a brief little bit of guidance on that. I've uploaded some instructions on how to use this as well, which you can also find. But I hope you have a great day and a great rest of other presentation to get to watch.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Do some political candidates use first-person (I, we) or second-person pronouns (you, you all) more often in their campaign tweets? In this course exercise, students learn how to test distributions (analyze, distribution, test probabilities) using a grouping variable (BY). The data set is comprised of tweets (N = 1107) from the early stages of the 2016 U.S. presidential primary season. First-person pronouns focus on the speaker, or possibly, the group to which the speaker belongs. Second-person pronouns communicate a personal connection to the audience, suggesting that the candidate might be seeking to establish a personal connection. A chi-squared test of the relationship between political party (Democrat vs. Republican) and the use of first person (Present vs. Absent) is significant. The distributions are tested against a 50-50 distribution to see if Democrats or Republicans are more likely to use first-person pronouns. The test of the use of second-person pronouns occurs at the canditate level. In this data set, political candidates use the second person in 20% of their tweets. But who uses second person more (or less) than the other candidates? In this part of the exercise, students compare each candidate's use of the second person against the group's 80-20 distribution. Tests are conducted on tweets from Hillary Clinton, Bernie Sanders, Marco Rubio, and Donald Trump. The results indicated that only some of the candidates used second-person pronouns more often than the group average. Hi, I'm Robert McGee. I'm an Associate Professor of Integrated Marketing and Communication at the University of Mississippi, also known as Om iss. What I have today is a demonstration of a teaching exercise I use with students. The title of the presentation is called Communication style and political campaigns, promoting a personal connection with an audience. The question is, do some presidential candidates use the first person or the second person more than others during their tweets on Twitter? This is an important question because we want to form a personal connection between a candidate and an audience. One way they can do that is by the use of language in their social media. The students manually coded tweets during one week of the presidential primary season in 2016. They recorded every tweet that was issued by all 19 or 17 presidential candidates at this time. What we're going to demonstrate today is how we can test the probability of a distribution by using the grouping variable by. The first thing we have after we recorded 1,107 tweets, the first thing we're going to test is whether the use of first person varies by party. This is a typical ChiSquare test. It's two levels: political party, Democrat versus Republican, and first person is either present or absent in the tweet. You can see the test of the relationship there. The likelihood ratio is significant. You can look at the graph which shows us that Democrats typically used first person a little more often than Republicans, and it was a significant difference. Now let's get on to the second person. You can do the same thing. Look at the candidate and the use or presence or absence of a second person in the tweet. You'll see also that it's a significant relationship. The likelihood ratio, the value is 83.7 and it is significant. Then you look at the graph and you see it well. Some people obviously used the second person more than others, but which ones were really different from the others? You can look at the contingency table, and in the contingency table, you look across the rows, you'll see how often each candidate used the second person. Like Ben Carson used it 4% of the time of his tweets. Chris Christy used it about 25, 26% of the time in his tweets and so on. We see Hillary Clinton use the second person about 16 or 17% of the time in her tweets during that week. But what we want to be able to do is test that specific probability or the probability of that distribution. It is Hillary Clinton's distribution of 17% and 83% really different from the overall average of all the political candidates. If you look at the bottom of the contingency table, you'll see that the distribution really was 80% and 20%. But you can also find this information by distribution. Look at Analyze, then Distribution, and we put the variable in the Y box and hit Okay. You'll see the frequencies or the probability the distribution is 80 and 20. So 80%, 19.9%, I roughly put it at 80% and 20%. What we want to know is if Hillary Clinton and other candidates use the second person more or less than this average. We're not looking at a 50/50 test, we're looking at an 80 versus 20 test. To do this, we are going to use the By box or the By field. To subdivide this distribution by each candidate, we're going to put the variable candidate in the By box. We still have our dependent variable in the Y box, the use of the second person, but we're going to subdivide it by the variable candidate, which will produce a unique or individual tests for each one of the candidates. When you look at this, you'll get a result for each candidate. For example, Ben Carson first, and then Chris Christy second, and so on for each one of the candidates. It'll tell us the same information that we have in the contingency table with the little graph. But what we want to know is if this distribution is different from the 80-20 distribution that we have for all of the candidates overall. To do this, we look at the person that we're interested in, in this case, Hillary Clinton, and we see that the probability of the distribution is 83 and 17%. We go up to where it says second person, the name of the variable, and click on the drop-down menu, the red triangle, and we find the command test probabilities. We're going to click on test probabilities and a new dialog box opens up. This dialog box lets us establish the own benchmark that we want to use. Rather than testing it against 50/50, we're going to test it by against 80 and 20. I type in 0.8 and 0.2 because that's what we're testing. I leave the setting at a two-tailed test. I don't know if it's going to be higher or lower than 80, 20 when I test these distributions. I'm going to leave it as a two-tailed test. But I put in my benchmark of 80% and 20%, which I got from the contingency table or from the overall distribution of the use of second person. Then we click done. Here's what we have. This is part of the results. You'll see that she had 96 tweets. Of those, 83% did not have the second person, 17% did have second person, and we're testing it against the distribution of 80/20. The likelihood ratio or the ChiSquare value is 0.69 and the P value is not significant. Her use of the second person did not vary significantly from the overall group average of 80/20. Let's try somebody else. We do the same thing. This time we'll do it for Bernie Sanders. He had 150 tweets that week. You'll see that he used the second person only about 5% of the time. We test that against the 80/20 distribution of the overall group of politicians, and we see that the ChiSquare is significant it's 29.7 or 29.8%, and the P value is less than 0.0001. So yes, his distribution or his use of the second person significantly varied, but in this case it was significantly less, only 5% compared to the overall average of 20%. It's significantly less for him. Let's try someone else. Marco Rubio was a presidential candidate in 2016, and he uses the second person about 24% of the time. We test that again against the 80/20 percentage, and we see that his ChiSquare value for this test is 0.88, and it is not significantly different from the overall average. A distribution of 20 and 80%. His use of the second person did not vary between his tweets versus the overall average of all the candidates. We'll look at another one. Here's Donald Trump. He had 105 tweets during that week, and you see that he used second person about 30% of the time, which means about 30% of the time he was saying you or you all or some form of that second person in his tweets. We want to test that against a distribution of 80 and 20%. The likelihood ratio is significant. The ChiSquare value is 6.4, almost 6.5, and the P value or the significance level is 0.01. You see here the test shows that or suggests that he used the second person more often than most of the candidates who were running during the primary season in January 2016. This is a way that we can use to test each one of those rows. At the beginning of the 2016 primary season, we see that Hillary Clinton and Marco Rubio used second person to do out as much as everybody else did in the electoral season. Bernie Sanders used the second person significantly less, and Donald Trump used the second person significantly more. This is a way to do a follow-up test on a Chi Square when you need to test the distribution of individual rows. You can do this using the Buy button. You use this to subdivide. The option to test the probability of a distribution allows us to set a benchmark or comparison or reference group to something other than 50/50 or generally whatever we might be looking at. In this case, we set it to 80/20. This is a way to do follow-up tests on a significant Chi Square when you can test the probability of a distribution. I'm Robert McGee at the University of Mississippi, and if you have any questions, there's my email address, feel free to reach out. Thank you very much.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Disease often shows a predilection for specific site locations. To demonstrate the body regions involved, researchers may use infographics or tables. While tables may be more descriptive and universal to any statistical software, heat maps rapidly communicate disease involvement across a cohort. At our institution, software to portray data into the confines of a human body is not readily available. Additionally, we needed the human body to be segmented into specific regions. While a graphic editor can be used, it is labor intensive, limits generalizability, and cannot be repeated with different data sets. To create a scalable solution, we used the Custom Map Creator in JMP, an add-in to Graph Builder that allows users to create custom shapes. With this program, we outlined a human body, created anatomic regions, and colored each region based on the percentage of occurrence. The selected color range is scaled to directly correlate with the data. Each anatomical subunit is then filled and shaded by the data in the respective column. Heat maps provide easy interpretation and clearly portray the most involved areas for rapid communication to readers. Today, I want to give a little presentation about using the JMP Custom Map Creator and how we use that to develop a human heat map. I'd like to give a shout out to my mentor, Dr. Benjamin Caffender, who helped push this project along. A little bit about the Custom App creator, it allows you to create custom shapes that can be displayed in Graph Builder. What this allows you to do is to take statistical data and percentages in particular, and present them in a color-coordinated and color-referenced presentation. The way that this is used, it comes with some built-in maps of the United States or different places in the world, is to be like, oh, in this state, this percentage of whatever is taking place is occurring. Gives you that heat map approach, but it also allows you to create different objects as well. For us, this was a dermatology project, so we had cutaneous disease, and so we wanted to show lesion involvement on what part of the bodies were affected by the disease. So we had face, 90 %, we had scalp, but just under 20 %. That thing, we wanted to show this dynamically. We ran into what's the best way to go about it. We did some digging. We found this custom app creator to do that. What we did is we took an outline of an image, and you can see all these dots. We went in and we traced this image and created our different body areas, and JMP will set those into specific shapes where you can input data that will appear in those areas. You can take whatever picture you want. This is is an example of one that we used. Then that data will get incorporated into those areas. So you can see over on the left of our image, really to show the face has significant involvement with the darker red, and that's correlated up with the 90 % involvement versus lighter areas such as the trunk or upper extremities. But the custom upgrade lets you create those areas and then put those inputs and put those percentages in very easily. Rather than me talking about it, I'd like to show you a little demo of how it works, and we will get going from there. I already have JMP open. Once you install this add-in, it's going to add-in map shapes and then custom map creator. We put custom map creator, and we get these different data tables that pop up as well as our map creator itself. So what you do is you can paste an image in. I have a file over here that I will just drag and drop in. You can copy paste however it may be. You can adjust and zoom and that thing to get it exactly how you want. Give it a name. Now you start drawing, you start tracing. I'll just quickly click through here. That's one shape. We hit Next Shape. You can change the names of your shape to whatever it may be. When we were using it, we had different body areas. If you look over on the right at the data tables up here in this table, this example X, Y, we're seeing the different, the very specific data points or dots that are being graphed. Over here, we're having these shape IDs be placed. You can see up here, I'm working on shape five. You hit Next Shape. As you put that in, you get these next shapes. These are linked between these two data tables. Now we have, let's say, I've finished my image, I have everything graphed out, you can hit finish, and that will just save it, or you can also finish with random data, which I think is an easy way. It's an easy wayto show it, but also just to actually have the data show up in the right place. Hit open. Now we have... This is our... This is our graph. We can see this data they gave us over here, shape one, I'll just drop this down to one over to shape five we'll put that at 100. You can see over here, shape five, 100, this is our red and has the most involvement. Down here, shape one, I guess we have a couple on the lower and this is the darker blue. But you're able to get that stratification. So what you can do is you can either have your own data table, or you can just put in your own data into these shapes as well, and it will adjust in real time. What's really nice about this add-in is you can go right here and you can change these gradients. Right now we have a red to blue. You can change it from this yellow to red. We can have that higher volume with the darker colors and less involvement with the lighter colors really easily. The two big things as you're doing this that are important to know is if this is your own data table, this shape, this name column and the column properties and map role, we see here this map name data table is connected to our example name. Jmp, which that is one of... That is this file over here. It is super important that those have to be linked. If you hit generate random data, it does it for you. If you have your own data table, you have to manually set that. But what you'll notice is if you manually create your own data table, here this shape 1,2,3,4,5 matches up perfectly with shape 1,2,3,4,5 here. Those names have to be exact and those roles have to be exact. If you do that, then you can have this data pop up really easily. You can also get these data tables through Graph Builder as well. But yeah, that's just a quick overview of how to use this add-in. It's very versatile. Like I said, we're able to use it for our project. Yeah, I hope this gave you just a brief little bit of guidance on that. I've uploaded some instructions on how to use this as well, which you can also find. But I hope you have a great day and a great rest of other presentation to get to watch.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

X-ray photoelectron spectroscopy (XPS) analyzes the surface chemistry of materials. It is also known as electron spectroscopy for chemical analysis (ESCA) and is commonly used to measure elemental composition/stoichiometry of thin film coatings in different industries. We have applied this technique to develop Atomic Layer Deposition (ALD) AlOx coating (of < 50 nm thick) processes by testing deposited film composition to identify the O/Al ratio. Our pharmaceutical program customers questioned whether XPS is an appropriate metrology technique to detect process variation in the coating composition. This presentation demonstrates the adequacy of XPS by using Gauge R&R in JMP 17. We designed our testing experiments using measurement systems analysis (MSA) designs platform, and a fast replicate crossed model was used with six sample coupons on two different kinds of substrates (silicon coupons and active pharmaceutical ingredient [API] pellets). Each sample was split into four parts and all 24 (6 X 4) samples were measured independently by two different vendors blindly. The data distribution was reviewed using a variety of methods: X-bar and R control chart, performed repeatability, reproducibility, part-to-part variation testing, calculated Gauge R&R (P/TV), P/PV, P/T in MSA Gauge R&R platform. Also, the evaluating the measurement process (EMP) platform was used to determine interclass correlation (ICC) and to identify if any interaction exists with either substrate type or vendor. Both MSA platforms confirmed that part variation is significantly higher than precision level, hence XPS is adequate to detect the variation in the process Hello . Good morning , everyone . Good evening , everyone . I 'm Sukti Chatterjee . Before starting my presentation , I would like to introduce myself with few words . I 'm Sukti Chatterjee from CTO team of Applied Materials . It is advanced technology team , and our team goal is to develop product adjacent to the semi industry or outside the semi industry . For example , we are working for aerospace industry , pharmaceutical industry , or industrial coating . This example , present example , we are taking from the pharmaceutical industry . My topic of the presentation today. Gauge R&R of X -ray photoelectron spectroscopy to monitor a coating process . Agenda of my talk today. Fi,rst, we will talk about the background and problem statement , then we will discuss about the operation definition and data collection plan . Next , MSA component analysis . Finally we 'll talk about the plan for MSA component improvement . Let 's start with the background . In the pharmaceutical industry , in therapeutic windows , therapeutic areas , there are , for example , antibiotic drug , alcohol addiction or cancer patient , they need everyday injection because drug level in the blood is certainly increasing , spiking in the blood , and then within short time , within a few hours , it is going beyond the therapeutic window limit . That 's why they need everyday injection and it is painful . It causes some side effect , that 's why patient skipping the medication or stop the medications . To solve this problem , our customer needs some approach to tailor the release of drug . Our team developed a barrier layer , aluminum oxide barrier layer , that forms a shell around the pharmaceutical particles . Properties of this barrier layer can control the release of the drug in the blood . Even it is possible like that , it can release few weeks instead of few hours . Here , we will talk about the composition analysis , and what is the noise analysis of this composition measurement ? That we will talk here . Our problem is measurement of AlOx coating composition . Our spec limit , customer spec limit , is O /Al ratio in the aluminum oxide film is 1 .2 -2 .3 . Our objective here to determine the XPS method if it is adequate to differentiate AlOx process variation . We will determine here , gauge R&R measurement error of XPS for AlOx composition analysis . X -ray photoelectron spectroscopy can measure quantitatively atomic percentage of composition . It can measure aluminum and oxygen percentage . XPS actually measure the kinetic energy of photoelectrons emitted from the elements and it counts the electrons . Whenever it is counting the electrons , it can count the presence of elements and also it counts the element which is bond to it . That 's why we can get the information about aluminum and oxygen in the aluminum oxide film . Most of the source of error for XPS , it can add it in gauge R&R . It can reproducibility , it 's coming from the calibration electron count . It can add repeatability and reproducibility error . Analysis can add reproducibility error . We will talk more details this one in the next slide . In our operation definition , we 'll talk about the different steps of the XPS measurement and how it can introduce the error in the measurement error GRR , gauge R&R error . Our objective measure aluminum oxide coating composition , and to measure it in XPS , first we need to do baseline correction . It is automatic , and then we need to go to the calibration . In calibration , normally applied materials have calibration sample , especially whenever we have some developed technologies like aluminum oxide . But in our cases , we are coating pharma particles and our process window is totally different from our applied materials core technology process window for aluminum oxide , because coating need to be compatible with the pharma particles . We are coating this particle , at the same time , we are coating also silicon wafer and API pallet because XPS cannot measure particles . It needs some planar substrate . That 's why we are depositing on silicon wafer and API pallet . Since we don 't have calibration sample , we are using the second option for calibration like carbon peak calibration . Left -hand side picture , you can see carbon peak calibration and it is manually need to do it, and that 's why it impact on the reproducibility . Then after calibration , we need to do XPS survey or high resolution scan to get the spectra. In the spectra , you can see oxygen peak, aluminum peak . Since we need to do manual calibration and we have automatic baseline correction , this can impact error on repeatability and reproducibility . Next , we need to do analysis . Analysis is peak fitting . We need to fit this peak and then we can gauge . From peak area , we can calculate the oxygen aluminum percentage . Since it is semi automatic , it can add error in the reproducibility . By XPS measurement , we are calculating O /Al ratio and our customer spec limit is 1 .2 -2 .3 . Next , we will talk about the cause and effect diagram , MSA cause and effect diagram . In MSA cause and effect diagram , we did some detailed analysis, and we found several one it can impact on the gauge R&R . We highlighted also major ones like electron counts , calibration analysis , we talked in the earlier slide . Now we 're adding another one , it 's sample loading , how it is added error in the gauge R&R . Sample loading , we need to do it . It 's not automatic , it is not full wafer . We are doing with coupon wafer , so we need to place the coupon . If it is location a little bit different or angle is little bit different , then it can impact on the measurement . This is impacting on reproducibility . All other major one impact we already discussed in the previous slides . Other one is the sample . It depends on the process . For this presentation , it is out of scope . We will talk about these four in this presentation . Next , our sample collection plan . For our sample collection , we use six samples for MSA analysis , and for these six samples we have four replicates . Here you can see these four replicates . We measure those samples in two sites . Since we have the four replicates , we are measuring those sequentially . Is it possible that if samples are degraded then sample degradation could be a risk ? We will talk about this risk later on more details . Our expected outcomes like that we need to find out XPS method is adequate to differentiate process variation . Also , you like to gauge like that whenever we are measuring two sites that have similar result . Also , we like to gauge like that sample or part is not interacting with the site . Now we need to do the MSA design . In MSA design , we are using substrate and site at the cost factor . This is the site , this is the part , and we have also two different substrate . We mentioned it before , API palette and silicon wafer . S numbers are silicon wafer , A numbers are API palette . We 'll not be able to use actually completely randomized option and we use first repeat . For first repeat option , here we are not changing the sample replicate number . That could impact on sample degradation problem . That 's why later on we will compare first and fourth replicate to check this sampling risk . For this MSA analysis , we sequentially use several JMP applications from JMP platform . We use data distribution of MSA samples . That is from descriptive inferential statistics application . It is from distribution fit Y by X . Then we check the data variability using control chart and one -way ANOVA . Then we analyze gauge R&R components . It is from the variability chart . Then we 'd like to gauge like that what is the relation with process capability with gauge R&R . That we can find out like that , interclass correlation versus P /T plot . Next , we did the root cause analysis to plan for improving the GRR . We will find out what is the GRR major error and how we can find out . That we are using for box plot , density ellipse , matched pairs , and fit line that are different platform of JMP platform . Let 's start with the data distribution . We developed process initially at the two spec limit , upper spec limit and lower spec limit . In upper spec limit , we have two samples , two parts and four replicates . All are measured two sites , and we already mentioned we did the first repeat . Similarly , at the lower spec limit , also we have four parts , four replicates and two sites . Since we did the process development at the two end of the spec limit , that 's why we can see that our distribution is bimodal . It's completely bimodal distribution . Problem of bimodal distribution , it can impact on the GRR components . It can impact on P /TV ratio , it can impact on P /PV ratio and misclassification . Since P /T ratio is not related with ... It is not dependent with the part , that is the reason P /T ratio it 's not impacting by the sample distribution . That 's why we will be used in our following slides . Our figure of merits we are using as a P /T ratio . For misclassification probabilities , there is five probabilities . Last three , it could be impacted by the sample distribution more , and first two is less impacted . To minimize the risk , again , we are focusing on the P /T ratio as a figure of merit . In the next time , our plan to do MSA analysis using uniform sample distribution . Let 's check now the variability of data . Here we can see that we use I -MR chart , individual moving range chart , and we saw that many data points are outside the control limit in the upper chart , and in the lower moving range chart , we saw that three data point is outside the control limit , and that these three data points , it is sudden shift . It is sudden shift , it 's not staying there , it is going back . It means it is the type II shift and there is a mixture of common cause variation and special cause variation in the control chart . That 's the reason here control limits are meaningless . We need to subgrouping with special cause and then only we can consider the control limits . Now we like to find out what are the special cause . First we will check if part variation could be a special cause . We did it using the one -way ANOVA and in one -way ANOVA , we can see there is a variation of the samples . We did the process near upper spec limit and we did the process lower spec limit . That 's why samples are different . That also we found by one -way ANOVA , and here we can see that within variation is very small compared to part variation , and also by analysis of variance is showing like that . Here our hypothesis is all parts are same , but it is rejecting the hypothesis because P -value is less than 0 .05 . It 's telling us it is significantly different . That means part variation is a special cause , so we can use as a candidate for subgrouping . Again , similarly we check with the site variation if it is a special cause or not . We considering two sites measurement near upper spec limit as well as near lower spec limit . We saw that here our hypothesis is two sites are measurement similar, and we found that its P -value is higher than 0 .05 . For upper spec limit , there is no evidence that we can reject the hypothesis . It is similar , on the other hand , for lower specs limit . It is marginally rejected because it is less than 0 .05 . For site variation , either it is marginally rejected or there is no evidence to reject . That 's why site variation is not a good candidate and part variation is the better candidate . What we did next , we make our control chart again with phase option and A here , sampled part at a different phase . When we do it , we saw that in a moving range chart , we found change in the variation in the measurement in the moving range , and that calculated the control limits for the bottom chart and the upper chart . Now we saw that all the points , all the measurement points are inside the control limit . These is the variations of each sample . It is the repeatability . When we consider site A and site B , and we saw also site B has also repeatability . But compared to site A and site B , there is some variation of repeatability . That is called reproducibility . Now we calculate the gauge R&R , all the components in the next slide , and we 'll find out what is the dominating error in gauge R&R . First , we did main effect . We didn 't consider for the main effect part and site variation interaction , so only the main effect . Here , we saw the repeatability , reproducibility . Repeatability is 22 % and reproducibility is 15 %. I already mentioned as a gauge R&R , we are considering P /T ratio because our sample distribution is bimodal , and we saw that P /T ratio is 26 %. It is passed , it is less than 30 %. It is marginally passed , and major error is 22 % repeatability . One more thing I should mention here , we are considering P /T ratio but P /TV or P /PV ratio is very close for our measurement cases because our sample distribution is bimodal and at the two end of the spec limit . That is the reason this ratio T or TV are very close or PV is very close . That is the reason we have this gauge R&R . This figure of merits is very close . Also , I should mention here type I error alpha and type II error beta . Type I error , all our data points within the control limit . That 's the reason our type I error good part is falsely rejected . It 's very small . It is less than 6 %. On the other hand , type II error , it is 6 %, it is failed . It is more than 10 %. Why type II error is higher ? Our repeatability is the major issue . Whenever we are measuring the samples , it is within the spec limit . But it is possible like that whenever a customer is measuring it . It could be beyond the spec limit because repeatability is high here . At this point , since we are developing the product , we are in the initial feasibility check phase . Customer is happy with this beta type II error , but we have option . If we can improve the repeatability , then it can improve this part also . On the other hand , if we can consider that part and site interaction , then we saw that part and site interaction is 6 %, not that much , but there is a little bit interaction. And when we didn 't consider the interaction in the main effect mode , then this interaction is added in the repeatability . That 's why we found that whenever we are considering the crossed effect , we saw repeatability little bit decreasing because our interaction is very small , not that much decreasing . Since this interaction is very small , our figure of merits are not changing that much . It is changing from little bit . Now from here , we know that our dominating error is repeatability . Before going about the more discussion with repeatability , first another thing I would like to mention , process capability with gauge R&R . Effect of gauge R&R on the process capability . Here , process capability we are plotting in ICC versus P /T plot . ICC is the part variation to total variation and P /T is the six sigma gauge , and USL minus LSL . We calculated from here Cₚ , and in our cases , in our process current condition , Cₚ is 0 .93 . It is less than one . It is in the red zone , and we need to go Cₚ ... For a good process capability , we need to go between Cₚ 1 .33 -2 . It is the yellow zone . To improve this Cₚ , what we need to do ? In this part , this is the process part and in this direction , it is the measurement part . Process variability or part variability is very high . For our measurement , we saw that our P /T is 24 %. If we would like to increase , if we would like to improve the P /T from 24 to suppose 15 % or 10 %, then we have to improve 30 % -50 %, and within that , our repeatability is the main issue . That is the reason we need to improve the repeatability . Now it is question . If we need to improve the repeatability , do we need to change our measurement tool ? That is again depending on the ROI that is question to our managing level , or we can address the repeatability in different way . That 's why we 'd like to find out the root cause why repeatability is higher . Here we are considering variability chart with analysis of variance . Here we can see that we plot all the samples variability together with site A and site B measurement . You can see that suppose , for a sample A0 , this is the measurement repeatability , and it is changing . This repeatability is changing for all the parts . Also repeatability is changing with the site to site because here you can see repeatability is 0 .06 standard deviation , but in these cases when they measure their repeatability is 0 .03 . That is the reason this repeatability is changing with part to part also site to site . Whenever it is changing with site to site , it 's called reproducibility . Here if you can consider the analysis of variance , then we can see that site to site variation is much smaller than within variation . This is the repeatability , within variation , and site to site . Site to site variation , it is reproducibility , it 's much smaller . Repeatability again from here also we find out that it is the bigger problem . Now in the next to find out the root cause , we plotted all the repeatability side by side together , and for both the cases , USL , upper spec limit and lower spec limit , and all the cases we found that its repeatability is different . Next we like to correlate or find out any relationship if it is present site A and site B measurement . Ideally , site A measurement will equal to site B measurement should be . But in our cases , we did some linear fit and we found that we have intercept as well as we have linear fit slope , it is not one , it is not zero . Here we found that linear slope is less than 0 .4 and intercept is higher than 0 .9 . Our fitting points are distributed widely . That 's the reason our R -squared is also poor . We also did the density ellipse and density ellipse also telling that this correlation is less than 0 .5 . If they have a very good correlation relationship , then it should be 0 .9 . If it is 0 .6 , then it will be moderately correlated . But in our cases , it 's not that . That 's why we know that site A measurement is not site B measurement . It 's the repeatability impacted on the reproducibility . Problem of repeatability is impacted on the reproducibility . Now we check more closely how it is different . We are comparing by match pair the site A and site B variation . Here our hypothesis is site A equal to site B , that means site A minus site B equal to zero . We saw that our probability for this hypothesis , site A minus site B equal to zero , is less than 0 .05 in both the cases . It is upper spec limit and lower spec . Both the cases you can see that it is probability is less than 0 .05 . That means site A and site B measurement is different , and you can see our difference of mean value and confidence interval is above the zero point line . That means though this is site A measurement is always higher for site B measurement . Now from here , our question appears , since we did the first repeat analysis for our MSA design is first repeat , it could be possible like that if samples are degraded , like O /Al composition is degraded . That 's why we did again match pair test with first and fourth measurement both in site A and site B for all six samples , and we found that here , first measurement minus fourth measurement equals zero . That is our hypothesis . We saw that P -value is higher than 0 .05 both the cases . That means our sample degradation is not an issue . First sample , there is no evidence . First measurement and fourth measurement is dissimilar . That means it is the measurement issue . For that , this is summarized in the dashboard table , in the dashboard , like our figure of merit for gauge R&R 24 % and repeatability is 21 % and that repeatability is changing from part to part and site to site , and we have always higher repeatability for site A compared to the site B . Now for our next plan , we plan for a discussion each site as well as with the process team . Site has a problem like repeatability as well part -site interaction . We know that what error could be introduced in the measurement like background /baseline correction , electron counts , peak deconvolution . We 'll discuss those methods source of error with site A person , site A facility , and we will find out how we can do the streamlining process for improve our MSA . Also we have a plan set up a calibration sample or we can set up a set up sample that we can measure in regular interval in the both sites . On the other hand , with the process team , we 'll talk to improve MSA next time to MSA data collection uniform . Instead of bimodal , we should collect the data uniformly . Then also we saw part to part repeatability variation . There is one reason it could be measurement issue . Another reason could be process is not uniform . We need to validate our thermal math to check our process uniformity . Finally , I would like to mention that what is the impact on my learning for this MSA analysis . Now we know that several JMP platform or JMP application can help me to know what is the signal variation from the noise variation , and then we can identify what figure of merit we can use to justify our measurement method . In our cases , we found P /T is the best method , best figure of merit to analyze it . Then how misclassification risk can relate to the MSA component as well as sample distribution that we learn . Root cause analysis , we did several JMP application that can help us to plan to improving MSA . Since it is very helpful for particular program application , that 's why I would like to introduce this data driven decision making for all the programs I involve in to improve the project quality , cost , and time . Finally , I would like to promote data driven decision using JMP in our advanced technology group like CTO team , or other different projects . This is my final slide . I would like to mention my journey . I started JMP learning beginning of the year , and that time we did A0 , A1 , A2 . This is my foundation . Then after I work with MSA analysis and SPC . I also got my certificate , JMP STIPS certificate May 2023 . Now I am instructor at AMAT JMP instructor . I 'm planning to in person presentation in October 2023 , and also I am working for my Black Belt on 2024 . Thank you for listening .

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Sterility breaches of pre-filled syringes of a drug product are not directly measured but are known to be a function of syringe dimensions, plunger movement and fill weight. Fill weight is dynamically controlled so a non-parametric fit in the JMP Distribution platform was used to fit a Kernel Density based on real-world data. JSL was used to simulate 10 million iterations based on the non-parametric fit, along with plunger movement simulations based on dimension specifications and measured frictional forces. Process time for the simulations were reduced over three-fold by using invisible tables, simplifying the output and eliminating saved formulas. My name is Briana Russo, and I'm a senior statistician at the Center for Mathematical Sciences at Merck. Today I'll be going over simulating sterility breaches with non-parametric data. At Merck, we often deliver our liquid formulated drugs in prefilled syringes. A group at Merck that specialized in that came to me asking me to see if I could simulate if there is any risk to sterility breaches in them, depending on historical data and some different scenarios they wanted to look at. There was two interesting parts of this that I wanted to go over today in my coaster and discuss a little further. The first was some of the historical data, specifically, the fill weight was non-normally distributed. When filling the syringes, it's not necessarily processing to a target. It's able to move within a range and even drift outside of that range for a bit before being corrected. That often results in some heavy tailing of the data, which you can see in the bottom left here. That's an example of that. We wanted to make sure that we were capturing that heavy tailing, because obviously that's where the highest risk is going to be. The other interesting part that goes specifically, into some JSL scripting is that I was dealing with a large number of iterations asked for by the customer because they were looking for 10 million per scenario because that's the order of magnitude, they were expecting to create the syringes. I was able to, during the project, discover some techniques to reduce the processing load on JMP that was able to significantly reduce the process time for when I was running the simulations and prevent any crashing or anything like that from memory issues. I'll touch on both of those things. But first, I wanted to go into a little bit about more background on the prefilled syringes and what we were looking at. As I mentioned, we have the fill weight data. That's the amount of liquid that's filled into the syringe. That again, I wanted to look at non-parametrically using a density function. I was able to find that that was very easy to do in JSL. I'll show how I did that. Then the other aspect was the plunger insertion depth. How deep is the plunger being inserted and how close is that to the liquid fill? Then the dimensions of the prefilled syringe. There is some variability from the manufacturer, I wanted to make sure that was being captured. There were two key outputs, and they were a yes or no output for each. The first was, we want to make sure that we were maintaining a gap between the liquid fill and the plunger. Because if we don't, then we're going to be getting liquid up on the plunger, and that could be a sterility risk. We wanted to make sure that the air gap length was always greater than zero. The other one was we also don't want that air gap to be too big because when we're shipping the syringes, say, on an airplane, they might be exposed to lower atmospheric pressures, which can cause the plunger to move up. If it moved up too much, it could go beyond a sterile barrier that was created when the plunger was inserted. We don't want it to go too low. We don't want it to go too big. But there's a lot that goes into the plunger movement, not only the air gap, which is a function of the dimensions of the plunger and how deep the plunger was inserted and how close it is to the fill. But again, also different atmospheric pressures and the cross sectional area, so the dimensions of the syringe. There's a lot of different inputs and different sources of variability potentially to that plunder movement. I wanted to be able to simulate all of those. That meant that I knew that my data table and JMP that I wanted to simulate into was going to be very big. The first change that I was able to make, to make these simulations a lot more efficient was actually just opening up the historical data that I was going to use, the data table I was going to use as being invisible. This made it so JMP didn't have to render the table, this potentially massive table I was going to create and was able to really reduce process time and also prevent jump from crashing at times, it said, the memory of my laptop was exceeded. Once I opened up the historical data as invisible, I then would add enough rows to that just blank rows to get me to 10 million, because obviously my historical data wasn't that big. But I wanted to make sure that the data table had 10 million rows, so then I could go ahead and simulate 10 million iterations. Specifically, what I did for the non-parametric aspect of the data was I fit the data in the distribution platform in JMP, and then I was able to just very easily use the fit smooth curve function to save simulations from that non-parametric data to 10 million iterations. Super simple and easy way to fit essentially kernel density function simulated values. The other two things that really improved my simulation was, as I mentioned, there was a lot of different calculations that I was doing within a data table and different scenarios over 20 different, for example, plunger depth targets we wanted to look at. As part of my JSL script, I wanted to be looping over different scenarios. But if I was just going to create a column that then referenced previous columns in a loop, that could cause reference issues for each iteration of the loop, because I would end up with essentially all of the new columns having the same formula because they'd all just end up referencing whatever the last iteration of the loop was. To prevent that, if I wanted to use a formula for the column, I would then need to delete the formula. Again, very inefficient. One very simple and easy way that I could get around this was instead of saving a formula for a new column, just use set each value. This means that JMP didn't need to save the formula at all. It eliminated that issue with the looping reference and then also, again, reduced process time. The final improvement that I made was by really working with my customer in this case, and really figuring out what exactly they needed, I was able to streamline things a lot. Because initially, I was just giving them the kitchen sink. Giving them distributions and histograms of every single parameter and output, which they thought was interesting but was not really worth the effort and worth the process time. What they really just wanted was what is the % failure rate for these two outputs? I was able to make delivering that a lot more efficient by eliminating the need of opening up, say, a distribution platform and trying to fit 10 million rows. Instead, I just made sure that any sterility breach, I just created a column where if a sterility breach occurred, it was a one, if it didn't, it was a zero. Then it was very easy to just calculate the column mean to give the percentage of failure for any scenario and directly output that to a journal. That way, the journal also wasn't having to be massive because it was saving so much information from the data table because it was creating graphs from it. Overall, initially in this project, I was able to deliver it, but by using the platform outputs, visible tables, and save formulas, it was taking at least three hours. Often, I was letting it run overnight, so I don't know the exact timing, but at least three hours. By simplifying the output alone, so going directly to the journal instead of saving from, say, the distribution platform and JMP, I was able to get this down to an hour and 49 minutes. Then just those two simple changes of making sure that the data table was invisible and saving values instead of saving, the formula got me down to 52 minutes despite the volume of calculations that were being needed to be made. Overall, it can be very simple and easy to simulate non-parametric data within JMP using these data tables and using the fit, smooth curve function. Then also, if you are simulating really big data sets in JMP, if you are simplifying the output, if you're making sure that JMP isn't rendering things it doesn't need to or calculating and saving things it doesn't need to, it can actually be very efficient in creating the simulations and giving you the outputs. In this particular case, using those techniques, I was able to reduce my simulation time over a three-fold. That's all I have. Thanks for listening.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Tennis is one of my favorite sports. The 'big three' of Federer, Nadal and Djokovic are my favorites. They elevated modern tennis to new heights in their rivalry. But it looks like Alcaraz is in line to take the throne. It is fun to compare their records and present the results at JMP discovery summit in Indian Wells which is the home to Indian Wells Open. I started with Association of Tennis Professionals (ATP) data from 2022 for my project. Next, I combined over 25 years of data that include all match records of Federer, Nadal, Djokovic and Alcaraz. Using JMP’s powerful analytical and visualization tools, this report provides insight into the questions of who won the most matches of a year? What are the factors that influence players winning matches? Most importantly, who is the GOAT? Hi, everyone. My name is Jianfeng Ding. I'm a research statistician developer at JMP IND. Today, I'm going to show you how I use JMP to explore the tennis data and find out who is a goat, the greatest of all time of men's tennis. First, I would like to give you some background information why I choose this topic. When I heard that the JMP Discovery Summit 2023 will be held in Indian Wells, I got excited as tennis is one of my favorite sports and my youngest son plays varsity tennis at his high school. I have watched a lot of tennis over the years. Indian Wells is a home to Indian Wells Master, which is often called the Fifth grand slam. I thought it would be fun to use JMP to explore and analyze the tennis data and present the results to our user at Indian Well Discovery. The second motivation come from JMP. JMP has grown bigger and richer in many ways. There are so many wonderful features created by my colleagues. I would like to keep myself updated with these new cool features by applying them to the project. Currently, you are seeing one of them, Application Builder. Instead of using PowerPoint, I'm using JMP Application Builder for today's presentation. My presentation mainly include two parts. I will take you on a tool to explore the ATP data from the year of 2022. ATP stands for Association of Tennis Professional, which is the world governor body for men's tennis. Then we will look at a 25-year combined ATP data to find out who is the GOAT. First, let's see where do I get the data? I get the data from the web and GitHub, which was created and maintained by Jeff Secman. He is a software developer working in the field of sports statistics. On this web, it contained the ATP match data from year of 1968 to the current year. We can also get women's tennis data from this web as well. What data looks like? Here is the data from the year of 2022. It consists of about 49 variables with about 3,000 observations. Each observation represent matches play on ATP tours. The yellow section contains a variable about the tournaments and the blue section contains a variable about the players. Each observation is a match, so usually the variable comes with two, one for the winner and one for the loser. Let's look at all those variables about the tournament first. I build the graph builder on tournament's name and a tournament's surface and a tournament's level. From the tournament name, the country with more player were sitting on the top. Sorry, the tournament with more player would sit on the top. Grand Slain, Australia, Roland Gallos , US Open, and Wimbledon are the largest and most prestigious tournaments. In last year, there are about 145 tournaments. We also can see there are typical five surface for the tournaments. They are clay, grass, and hard and usually they are more hard surface tournament than the grass and clay. A lso there are five levels of these tournaments. The definition is defined here. A, D, F, G, N. G stands for the grand slam, and N stands for the Masters. Indian Well Master is a master-level tournament. D stands for Davis Cup, and A is the ATP Tour. Next, let's look at the variable about the players. I run the graph builder again. The plot on the left actually show me which country has the most player. On the right, it shows those players' hands. Do they use the right hand or they're using left hand? You will see the player most are right handed. I also would like to find out which country has more top ranked player. I created this, the winners rank and I can slide. The country with more top-ranked player will pop up. I'm interested to see what about top 100 and US sitting on the top. That means US has more top ranked player than the other country. Then what about the top 10? Look, you can either slide or you also can type in the number. From this, Spain popped up at the top and I hover over, I saw Carlos and I also saw the Nadal. As I click the US and I see the player, Taylor Fritz, who ranked number nine. You also can see from the hand side and Nadal within this top 10 player, Nadar is left handed. He's one of left handed in this top 10 player. Now let's move on to check the players' age, height, and ranking. The tournaments, the range can be ranged for the last year, they actually can range from 17-42. In this graph, I only listed the top 10 with their average ranking. From this I find, their average height is around 6'2, which is very common for males tennis player. I also find Raphael Nadal and Novak Dj okovic are the oldest in this list. Now, let's look at the winning statistics because I would like to see who win the most matches in 2022. I find out Tsitsipas list as the number one. Then something is missing. Where is Rafael Nadal, and Djokovic? I couldn't find them in this top 10 list who win the most matches. This remind me maybe I should look at their winning ratio instead of just number of matches they won. I did some summary statistics and I find out their winning ratio. Yes, you immediately see, Novak Djokovic, Rafael Nadal , and Carlos Akras, they have a pretty high, they are the top three player who has the highest winning ratio. Although their number of winning for the matches is not as high as Tsitsipas. I also noticed there are two players who has pretty decent, pretty good winning ratio, but they don't play many matches. They only won three matches. Who are they and what type of tournament do they play? I drilled down into the data and I find out one player's name is Kovacevic and all his three matches coming from tournament A level and the player, Safwa his all three matches coming from Davis Cup. From this graph, you definitely know the tournament level will affect the winning. Ultimately, you care about who won the most championship or tournament wins. This graph put all three relative statistics in one plot. The down you will see how many matches they win and the second, the green bar means what are their winning match-win ratio? The top will show you how many total championship they won in 2022. I see, Djokovic, Carlos Alcaraz and Rafael Nadal. I also see one guy who I'm not familiar with, and his name, hard to say, but let me call him FAA. FAA doesn't have amazing winning ratio, but he did won five titles. Again, I drill down to the data and find out all FAA's winning title coming from A-level tournaments. You look at Djokovic or Alcaraz and Nadal, they are championship not only from A-level tournaments and also from grand slam and a Master level. Again, we show tournament level effect winning. Let's look at the seed. What does seed play in the players' winning? I have to point out the players' seeds actually will vary over the years. But in general, the higher seeded players tend to win more matches and more tournaments. Grand slam winner usually are highest seeded players. But in 2022, only two people are exception. One is Carlos Alcaraz and the other is Taylor Swift. Sorry, it's about Taylor Fritz. You can see here, Carlos, he succeed, start low, but he won the Miami Masters. This helped him move to the top. In the end, year of 2022, he was ranked as the number three seed, and he was able to win the US Championship. Taylor Fritz, he actually won the Championship of Indian Well, Master 2022. We can see the seeds definitely affect the winning. Now, let's look at the comparison between the winner and the loser. In this ATP data, there is a section list about to serve statistics and come with a winner and a loser. There are seven variables related to the serve statistics. I'm interested in this first one. What it is? The first one means number of points won on first serve. I click and build a plot. Instead of I plot all those absolutely the number of the point I use the ratio because the point will depend on how long you played your matches. With the ratio would make more sense. The blue colored represent the first serve percentage won coming from the winner and the pink is coming from the losers. Actually, majority of the first serve percentage won between 60% and 90%. But the blue color shaded more to the right, indicating winner have higher first serve percentage won . Next I would like to be interested to see the variable is BPs saved and BP faced. BP faced means a breaker point faced. For if you serve and you face the breaker point, that means you give your opponent opportunity to break you. You better not t o face the breakpoint. Instead of plotting separately, my son suggested me to convert them to be breakpoint converted, which is a variable defined as the difference between the B P faced and BP saved. Then again, we can see the blue color shaded more towards the left, indicating winner face less breakpoint and save more breakpoints. The pink one indicates that loser tend to face more breakpoint and save less breakpoints. With all these statistics and variable I have shown you, but ultimately I would like to know, can I build a model? Can I predict who is going to win and how many they can win? I build a summary table and as I shown you, all these ATP data come with matches. A player can have many matches so I just use a tabulate to do the summary statistics. I got the tournament wins for each player and I got the average their winning match ratio and their height and their average, their seed. I wanted to find the correlation between the variable to the tournament wins. Clearly you can see the match winning ratio is highly correlated with tournament wins and so is winner's seeds. Also I defined one variable I call the div rank, which I know when you face a weak player, opponent or strong opponent, your winning odds could be differently. I do the subtraction, I introduce this variable into the model. You also notice the height, there is the correlation between the variable. I just happen to notice when you're higher or you're taller and you tend to have a better ACE rate and you have better, like the first one, serve one. Definitely the taller player has advantage at serving. I bring all these model into the fit model platform. I first run a Least Square model and I get the conclusion that the winning ratio and the winner's seed are definitely affect how many tournament you can win. I also think, oh, this is a count of data. How many tournaments you will win. Maybe I should use [inaudible 00:19:36] distribution and I run and I actually also get the similar conclusion that winning ratio and winner seed is very important variable. But I have to point out, although I show you early about the tournament level plays a very important role on the winning, but because the data, the format itself made me hard to put it into the model. I need a lot of data manipulation. Plus, I feel like instead of just looking at the one year's ATP data, maybe I should look at more in order to build a complete or good predictor model. I will keep this in mind for my future research. With all these statistics and a variable, I show you so far. That's back to the topic, who is the GOAT ? I actually created a script and I wanted to get the data in the past 25 years as Federer started early. I wanted to include all the matches, all of them have played. I would like to find out who won the grand slam title and who won the Indian Wells. This script actually is able to go to the Jeff Sexel web and fetch the data and do the analysis and generate the report. You can see 2023, Alcaraz won both Indian Wells and Wimbledon and Novak won Australia Open and Roland Gallos. As the list moved down, you pretty much see their name, Djokovic Nadal and Federe r, so on. It's almost for the last 20 years, these three are dominant. As I keep moving to the bottom, finally, I see Andre Agassi and Pete Sampras, who are my favorite player in '90s. Also you see these three guys, Djokovic, Federer, and Nadal, they sit on the top. This include a grand slam title and Indian Wells title. I truly believe these three guys, they move the modern tennis to high level. Now, let's look at again, look at the match wins, winning ratio, tournament, and Grand Slang title. I would like to see the more detail. The green bar here, the bar itself represent their match winning ratio. But I like Graph Builder's feature. It allowed me to put their number of winning matches on the top. Then you can see, although their winning ratio is very close, they all like above 80. But Roger Federer won the most matches over 1,263. You move to the top and you will see those green bars means how many tournament championship each of them have won. Again, Federer won the most. Then you look on the blue top, you will see that Djokovic won the most, 23 grand slam titles. Next, I want to check on their ranking. These four lines not only show their ranking over the years but also show their incredible professional tennis career. Federer started early in 2001. It took him about three years to move to the top, but he stayed at the top for a long time, 18 years. You look, Nadal and Djokovic, they move very quickly to the top and also they stay at the top for a long time. The dip here usually either means they had injury or had a surgery to recover. I know Nadal is right now in the recovery period because he just had a surgery and Djokovic continue to play. I truly believe that those two lines will continue to grow for a while. For Alcaraz, he just started. We will see if he will follow the same trajectory as the big three. I would like to show you more detail about the individual grand slam matches. Look at this plot on the left. This show in the past 25 years, how many grand slam matches Federer has played. Total 434 grand slam matches. He won 373 matches and he lost 61 matches. That bring him to the winning ratio is 86 %. It's amazing. The right-hand plot, actually a plot, his opponents ranking. I want to show it's difficult. Usually when your opponent has a high ranking, that means tough to win the match. The red dot here all represent the winning matches and the blue dot here represent the losing matches, and the square indicate the final matches. These are all grand slam matches. You look, most of the Federers' opponent is all high rank player and only the few, I guess he was lucky. He was able to play the opponent with low rank. We also can look like how his performance in each grand slam as I click Wimbledon, you will see, Federer won a lot in Wimbledon. Then let me click the one for the Roland-Gallos and in Federers' entire career and he only won once in the Roland-Gallos. That was the year 2009. The other day, he pretty much lost to Nadal. Let's see what happened in 2009. I bring Nadal's record and I particularly look at Roland Gallos. You pretty much see all the red square. That means he's the championship of the Roland Gallos. He only lost four matches, included this one in 2009, in the semifinal, he lost. That was the year, actually, Federer was able to win the championship. I will skip, Novak and Carlos, and I will bring you the overview of all these four guys' performance in all the four grand slams. If I look at each one for the Australian, you pretty much see Novak Djokovic is dominate. Then if you look at the Roland-Gallos, Nadal is dominate. For the US Open, they all have won the US Open. I guess US Open provide opportunity for all of them. If you look at the Wimbledon, I think Federer and both Djokovic, they both did pretty well in Wimbledon, but Federer still win more than Djokovic. I wanted to finally look at their gra nd slam winning ratio. From this plot, it shows me, yes, Djokovic won the most grand slam title. Also you look at the winning ratio, overall, Djokovic has highest or similar like the Rafael Nadal. Almost in every category, you can see Djokovic has higher winning ratio, except for the Clay, the Roland-Gallos, Nadal, is the best. I would say just based on winning most grand slam title and highest match ratio, Djokovic is the goat. Next, we would like to find out who is the youngest among four of them winning the grand slam title? That was Nadal. I think he was only 18.9, he won his first g rand slam title. Alcaraz at age 19.3 won his US Open. Although, Djokovic and Federer won their first title in their 20s. But you look at their long, amazing career, even at age 36, both of them still were able to win the grand slam title. I think that Djokovic will continue to win. I think he will have more title under his belt. I also look at, they definitely played with each other. I wanted to see their net win with each other. Rafael Nadal, if you look at Rafael Nadal against Roger Federer, so Rafael won 24 and then Roger won against Rafael is 17. That bring their net... Rafael has seven net wins against Roger. Novak Dj okovic has five net wins over Federer and one net win over Nadal. Even based on net wins, I think Djokovic is a goat. I still would like to see their serve statistics because from that ATP data, this is the data more related to their techniques. I put all these variables into the one way and utilize the fit group. With such, you can see there's a lot of the data, the sample size is bigger. With all the data together, it seems that Djokovic has a better serve statistic than the rest of them. But I realize this is big sample size. Sometimes the large sample size can transform a small difference, become a statistically significant difference. I would rather to see the subset. I look at like a small sample size and I look at Wimbledon. Yeah, and in Wimbledon, I still can draw the conclusion that Federer is a little bit better than the rest of them. But once I look at the other grand slam, like the Australian Open, and I cannot draw the same conclusions. Overall, I think their technique is very, very similar. The successful rate for serving, they have very similar statistics. With all the statistical variable, I show you according to statistics of winning most grand slam title and the highest match winning ratio, Djokovic is the GOAT. However, statistics don't paint the entire picture as a player can have a much larger impact than just statistics, such as the way they play the game, the love for the game, and especially who this player inspire. Such as the young kids, who aspire to be just like their idols, including my son, whose dream is to play Eastonball, a prestigious tournament for youth at Indian Wells. In the end, it was just an honor and a privilege to watch these three great player to play the game, play the tennis all at the same time, and the future looks bright for more great tennis to watch. As other player such as Carlos Alcaraz, and others look to follow in the Big Three's footsteps. I had so much fun doing this project by using features such as graph builder, dashboard and application builder in JMP. This feature allowed me to easily explore big data set and quickly identify the atypical observation. Dashboard not only can put a different analysis in one report, but also allowed me to stay in the report and rerun analysis after the modification. Application builder allows me to present to the project without having to use PowerPoint. Although this project mainly analyze ATP men's tennis data, the analytical tools and the flow can be easily applied to women's tennis data as well as any data set that have patterns in other fields. If you have any questions, please feel free to contact me. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

You have a question to answer, so you collect the relevant data and are ready to start creating a predictive model. But what which type of model do you choose and which modeling type? Is the goal to segment, predict, explain, or identify? Are the variables continuous, nominal, or ordinal? Before we even get to choosing a type of model, we must define how the variables are used in analysis with the modeling type. What happens if we treat number of bedrooms as continuous versus a nominal or ordinal modeling type? We see when we pull up a distribution how modeling type impacts the type of summary statistics we get as means or frequencies. This poster demonstrates how the modeling type determines the results of your predictive model depending on which type of model is chosen. Get ready to play “Name that Analysis” as we go head-to-head on classifying different case study modeling examples with their respective modeling type. Hello, my name is Andrea Coombs, and I'm joined by my colleague, Olivia Lipp incott. Olivia and I have given a presentation before called What Model When. If you want to take a closer look at that presentation, you can take a look at the link in the community post. But today, we want to talk about something a little bit more. Yeah, today we want to think about modeling type and how modeling type impacts the analysis for each of the four model goals that we talked about previously. Right, and we're actually going to use the same data. This is data that we pulled from Redfin that represents the housing market in the Cincinnati area. Here we're trying to look at the price of homes relative to their square footage, the number of beds, the number of baths, and so on and so forth. Previously, we've answered the question, what model when? It really depends on what model you're going to choose based on your goal for the analysis. For segment, we're trying to examine relationships where there's no intended response; explain, we're trying to explain a relationship and look at the underlying factors and how those affect the response; predict, we're trying to predict future outcomes or the response in new situations; and identify, we're trying to find important variables. Right. Now let's bring the modeling type into the picture. Both your responses and your factors can have different modeling types. In JMP, there are three main modeling types: continuous, nominal, and ordinal. Continuous modeling type is represented by this blue triangle icon here, and this refers to numeric data only. The nominal modeling type is represented by this red icon, and this is numeric or character data where values belong to categories, but the order is not important. For the ordinal modeling type, it is represented by this green icon, and this can be either numeric or character data as well. But in this case, values belong to ordered categories. When you're doing an analysis in JMP, you want to make sure you set up the correct modeling type, because JMP will do the correct model for you, will do the correct analysis depending on modeling type. Andrea, I have a game for us to play. It's called Name That Analysis. Do you want to play? Absolutely. I love games. Awesome. Here's your first question. We want to identify which features of a home are most important to determining the price. For example, square footage and number of bathrooms can explain a large amount of the variation in price, but other features are less important. All right, Olivia. I think you're making this first question easy for me. Is the answer identify? Let's see. Yeah, you're right. I did make that one a little bit easy to get us going, but that is identify to find important variables within there. There's a couple of different places in JMP where we can use tools to identify if that's our modeling goal. Under the Analyze menu under Screening, Predictive Modeling and Fit Model, using tools like Predictor Screening, Bootstrap Forest, Generalized Regression and Stepwise Selection. For modeling type, when we're looking at the goal of identify, it's not going to affect things much. JMP is going to do the correct analysis as long as your modeling types are set appropriately. We took a look at this and we took both the response and the factors and changed them from continuous to nominal and looked at how which factors came up as most important. While the order of the factors varied, the dominant factors stayed the same. All right. It looks like if our goal is to identify important factors, really, the exact modeling type we're using isn't impacting things that much, it looks like, Olivia. Right. Our conclusions on which variables are important aren't going to change much based on the modeling type. All right. Well, that is good to know. I have a question for you. Are you ready? I'm ready. All right, here is your question. Let's say we want to build a model to predict house prices. This model will be based on many important predictor variables we have in our data. For example, we want to predict the price of a house that we want to put on the market. Which goal do you think we're working with here? Okay, so it's not like question one where we're trying to see which factors are most important to predict housing prices. We're just really trying to get that final housing price prediction. I'm going to go with predict. All right, let's see if you're right. Yes, you are right. The goal of this analysis is predict. There's lots of different platforms in JMP where you can build models for prediction. Within each of those platforms in JMP where you can build the prediction models, JMP will do the correct analysis for you, depending on the modeling type of your response. Here we have a table of different modeling types for our responses: continuous, nominal, and ordinal. For a continuous response, this is the typical one that we were talking about, right? We want to predict the price of a home that we're going to put on the market. Now, when we're building this type of model with a continuous response, well, we want to know how powerful that model is. What's the predictive power of that model? We can use RSquared and the Root Average Squared Error to diagnose that model. Now, for a nominal and ordinal model, it's a little bit different. For a model with a nominal response, we have categories as the response. In this example, we're looking at whether or not the price will be over or below $1 million. That's what we want to predict. For the ordinal response, here we have an ordered category. We want to predict whether the price of the house is going to be low, medium, and high. For the nominal and ordinal examples, again, we can look at RSquared and Root Average Squared Error to evaluate those models. But there's other things that we can use to evaluate those models, like the misclassification rate and the area under the ROC curve. Of course, our favorite tool in JMP to take a look at our prediction model is the Prediction Profiler. Let's take a look at the difference between the Prediction Profiler for the modeling types of our responses. For the continuous response, we can see that on the Y-axis, we have the mean prediction plus or minus the confidence interval given the value of the model factors here on each of the X -axes. For the nominal and ordinal logistic models, what we see on the Y -axis is the probability of the response being in a certain category. For the nominal logistic model, we have the probability that the house is either going to be above or below a million dollars. For this ordinal logistic model, we can see the probability of having a low, medium, or high price. Okay, so it sounds like the goal of what we want to predict is also important when we're talking about that prediction goal, whether we want to treat price as continuous and get the predictions of the exact prices out of there, or if we want to treat it as a category. Right. You just need to get that response variable set up and your data set the correct way, and then, of course, assign the correct modeling type, and JMP is going to build the correct model for you. All right, Andrea. Are you ready for your next question? I'm ready. Let's go. Okay. We want to quantify the effect on home prices from additional bedrooms. For example, on average, every additional bedroom adds about $ 97,000 to the total home cost. Adding a bedroom adds $97,000? Man, Cincinnati is a tough housing market. That's crazy. All right, well, so let's see. What are we trying to do here? We're trying to quantify the effect here. I think what we're trying to do is explain that effect that bedrooms has on the price of a house. I'm going to say explain. You're correct. Yeah, we're trying to describe the relationships. In explain, we use the parameter estimates taken from the model equation to quantify those relationships between the factors and the responses. Typically, we use in JMP under the Fit Model menu location tools like Standard Least Squares, Logistic and Ordinal Regression, and Generalized Regression. Modeling type can really impact how our factored relationship with the response variable is interpreted. We took a look, and we were looking at how does the number of beds affect the housing price? We changed beds from continuous, to nominal, to ordinal, and see what that relationship was. We can see under the continuous, that's where we've got that every additional bedroom adds $97,000 about to the total home price. That prediction profiler shows a linear relationship when we treat beds as continuous. But when we treat beds as nominal or ordinal, there's not that straight linear relationship going on. We see a spike in price for 4-5 bedrooms compared to going from 2-3 bedrooms. Right. I see with nominal and ordinal, the prediction profiler looks almost exactly the same, so it must be the same model. However, I'm seeing with the parameter estimates, they look a little bit different between nominal and ordinal. What's going on there? Yeah, so the nominal and ordinal modeling type, and when we use that within a regression, is treating... They're coded differently within the regression, so the parameter estimates are different. For nominal, that intercept, we think of that as the mean house price across all the different bedrooms, and each of those parameter estimates are how much that number of beds increases or decreases that mean house price. But for ord inal, because we're looking at order matters, we think of the intercept as if there are zero bedrooms and each of those parameter estimates is the effect of adding an additional bedroom onto the price. All right. Modeling type is really going to affect my parameter estimates. I really need to think about exactly what do I want to explain as a part of this model when I'm doing this analysis. Yes. All right. Are you ready for the final question, Olivia? Yeah, bring it on. All right, here's the question. Let's say we want to identify groups of homes that are similar based on a list of possible characteristics. In other words, we want to identify market segments based on things like square footage, location, number of bedrooms, et cetera. Which goal do you think this is? I think you're trying to trick me with that identify, and I'm not going to fall for it. Okay. But there are no responses within this question. I think we're looking at clustering. I'm going to say segment. Okay. Well, you're right, Olivia. I did try and trick you a little bit because I really wanted to win. But you're right, that's the key thing here, is that there are no responses here in this analysis. We are definitely looking at segment. When our goal is segment, we can use a couple of different clustering tools. We can do Hierarchical Clustering, K-Means C lustering, or Latent Class Analysis. It's important to keep in mind that with Hierarchical Clustering, you can only include... Sorry, you can include all of the modeling types: continuous, nominal, and ordinal. But for K-Means Clustering, you can only include variables that are continuous. For Latent Class Analysis, you can only include nominal or ordinal variables. In our case here, when we're looking at the number of bedrooms, lot size, year built, and square feet, we have a combination of continuous and nominal variables. Hierarchical Clustering may be the best clustering tool to use in this scenario. It looks like with that parallel plot with Hierarchical Clustering, maybe we could call Cluster 6 Amazing Location. Yes. If you think a large lot size is an amazing location, yeah, we can definitely call that segment Amazing Location Homes. Well, all right, Olivia, despite me giving you a trick in that last question, it looks like we ended up with a tie here again. We'll have to rematch again soon. Absolutely. We talked about what model when, and that really, what model you choose depends on your goal for the analysis, whether it's segment, explain, predict, or identify. Yeah, in terms of modeling type, again, JMP is going to do the correct analysis for you, especially with your responses. If you're setting them up with the correct modeling type, JMP is going to do the correct analysis for you. If your goal is explain, you might need to think a little bit about which modeling type to use, depending on how you want to explain the effect of something like the number of bedrooms. Thank you, Olivia. This is so much fun. Let's do it again next year.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Cleaning large systems effectively is critical to many applications within industrial manufacturing. Establishment of a system's hold-up volume and the target concentration of residual contaminant after cleaning do not fully define a cleaning method, however. Assuming a cleaning method that repeats the same flush volume until the system has reached its target contaminant level, viable cleaning schemes will differ in how many flushes of the volume used to reach that target. By creating a graph showing "Flush volume" as a function of "Number of flushes needed" with editable table variables of "System hold-up volume" and "End contaminant level" in slider boxes, the user can visualize many different scenarios in many different systems before selecting the scheme that best fits their varied needs. Hello, my name is Margaret Wilson-Moses, and I work at Inpria Corporation as a Senior Product Engineer. Inpria Corporation is up in Corvallis, Oregon. Inpria designs and manufactures metal oxide photoresists to unlock the full potential of extreme ultraviolet or EUV photolithography. At its most basic level, Inpria is made up of chemists and engineers. The chemists at Inpria are the ones developing and designing the chemistries, testing them and refining them to behave how they want, and then taking those small formulations and giving the best designs to the engineers. The engineers scale up those designs to manufacture in our high- volume manufacturing facility where we have tighter control over processing and higher purity, and that finished product goes to our customers. Both the chemists and the engineers at Inpria use JMP extensively. The chemists use JMP to design their experiments to analyze the results of those experiments, often using hypothesis testing. They also use JMP to build models, so they're better able to predict how certain parameters will affect the response variables they're seeking. They also use JMP to present their findings within their teams and across Inpria. Engineers use JMP to work up our process data. We have sensors on our manufacturing facility that are collecting data every two seconds. We have a large amount of data and we use automated scripts from JMP to pull out the pieces of that data that we need to be able to analyze processing much more simply. We use control charts to look at process parameters for different batches and also analyze QC metrics of our product over time for different formulations. Those QC metrics end up on our certificates of analyzes. We also use measurement system analyzes or MSA to perform gage studies on the tools that we use to collect the quality control metrics. We also join data from various sources, as do the chemists, because different groups at Inpria are always collecting data, so it's good to be able to have a way to join the data and look for correlations between those data sets. Finally, we also use JMP to present our data both internally and to our customers. A couple of years ago, when we were bringing up our high- volume manufacturing facility, I was tasked with the goal of developing a cleaning scheme for our manufacturing facility. Inpri a, at that time, and in a way today, cleans our manufacturing systems by dilution to ensure lot-to-lot segregation. To ensure that lot A does not end up in lot B or it's limited to a very small amount. I've modeled that here. The top left, we have our finished product. When we transfer that to bottles and we're done wasting, we've taken all we can take out, we're left with a little bit of heel in our system. That heel cannot be drained or flushed out. It's just unreachable by the pumps. To that heel, we add a flush that doesn't contain any of the starting material and we end up with a diluted heel. We waste as much as we can and end up with the same volume of heel, but at a lower concentration of X. We repeat this process, flushing and wasting, flushing and wasting until we have the concentration that we need at the end. Further on, I'll model this by calling X₀, the concentration of X in the finished product or whatever the heel starts at, so which is going to match the final product. Then X₁ is the concentration of X after the first flush. X₂ is the concentration after the second flush, and so on up through Xₙ. The flush volume here, called F, is the same volume repeatedly throughout the process. My objective when approaching this problem was to develop a dynamic and user-friendly visual model to show to my team how different variables in this problem a ffect the potential cleaning schemes we could use with the overall goal of supporting selection of a scheme that would fulfill all our criteria. Flushing. Cleaning by dilution can be modeled with this equation here, which I just pulled from the math that comes out of the previous slide taken to N flushes. Xₙ, the amount of X remaining after N flushes e quals to the variables on the right. So where H is the volume of heel, F is the volume per flush in the cleaning, N is the number of flushes, and X₀ is the initial amount of X. This equation does importantly assume that every flush and heel are perfectly mixed per flush, but we do have experiential data that supports that that is indeed the case, thankfully. We are also assuming that the heel volume is consistent every time, so we're wasting to the same exact heel, and that the flush volume is the same exact volume every single flush. But again, we have experiential data that supports that, so we're okay with these assumptions. Functionally, we are actually more interested in translating this equation to be a function of N, the number of flushes, and F, the volume per flush, because the other variables in this equation are pretty much set. Heel or H is constant per manufacturing system. It may vary per system, but it is known through the data we've collected in the past to calculate what it is. X₀ and Xₙ vary per cleaning, but again, we have analytical methods to understand both X₀ and Xₙ per cleaning. Those are our known variables. Flush volume and number of flushes are the two variables that we're varying to figure out a cleaning scheme. Those are the ones to be determined. We have a few boundary conditions also. The total amount of solvent used must be as low as possible. We didn't come into this with a hard limit, but more solvent used is higher cost both in raw materials and waste, so we want to limit it as much as possible. Another boundary condition is that we want to keep the number of flushes less than or equal to 20 in order to be operational. That's because we want to fit our cleaning within our production cadence. Lower flushes is better operationally, but we have a hard limit at 20. Flush volume also has a boundary condition. It has to be less than or equal to the capacity of the system. In this case, I've set our capacity equal to 60 gallons. The maximum flush volume you could have is clearly 60. If you take the equation I showed on the previous slide and these boundary conditions along with the fact that we know H, X, not an N, you can actually solve this equation. You can find F and N to satisfy the boundary conditions. But that misses the goal that I had when I came into this process. To explain a little bit more about that, there's many ways why going with a visual model through JMP was more advantageous for me when I approach this problem than just working with this stagnant equation. First off, this equation is not intuitive. I'm pretty good at math, and I can't see what this equation means in a graph form by just looking at it, versus when you have a visual model, you can see how different variables affect each other. The equation is limited. You may have one H, where you may have one H, when you want to have one yield value per manufacturing system, but you have different manufacturing systems. You may want to vary H and see what the different cleaning schemes are. This equation really fits. You get one answer per one set of input variables. With a visual model, you have a very versatile relationship that you can see and it covers infinite different cleaning schemes. The equation is also very individual. I crunch this math, I make the assumptions, I put the input values in, and then I tell my team how many flushes we should use and what the volume per flush should be. They can check my math, but they're not involved in that process at all. They just say, "Okay." Versus the visual model that I built is user-friendly and dynamic, so they can tinker around with themselves and see how the variables affect each other and come to their own conclusions, which also leads to more buy-in from the team when we end up choosing a cleaning scheme. Lastly, an equation is boring. I like math, and it's still more fun for me to play around with the graph than it is to use an equation. The visual is much more interactive. I'm going to move from here to sharing my model. I started with just a simple data table with three table variables. I have X ₀ here and Xₙ both in ppm. And I've started X₀ to be 1 million ppm, and Xₙ is the target end goal for the cleaning, and heel is the heel of the system. The number of flushes here is just an independent variable, so it's just an integer value between 1 and 35. I did go beyond the maximum number of flushes just to get a better view of the relationship. Then the flush volume column here is built off the equation that I showed previously. I open up this formula, you can see that this is the same equation I shared in the slide, and it's a function of the table variables. It's a function of these table variables here that are brought into the equation. I've also rounded it to one decimal place because our control system only has that amount of sensitivity when adding flush volume. Finally, I have a third column here which is the total solvent used. This is simply a product of the first two columns. It will show you how much solvent you've used per cleaning scheme. It just multiplies the number of flushes by the flush volume to give the total amount of solvent used. When I go into my cleaning model, it's built around Graph Builder. I'm going to set my standard here to be 1 million ppm as the starting value. Let's say our target is 10 ppm of residual X at the end of cleaning and our heel is five gallons. You can see here that we have an X on the graph of number of flushes and then there's two different Y values. There's the flush volume in gallons and the total solvent used in gallons. I also have a reference line here at 20 to show the maximum flushes that are operable. Each of these points reflected here is a full cleaning scheme. If I select this point right here, this is saying 10 flushes. With this starting value of X, this goal end of X at this heel value, and you're going to have the flush volume of 10.8 gallons per flush in that cleaning scheme. When you look at this graph, all of the red points to the left of this black line are valid cleaning schemes. Some of them use more solvent, total solvent, and some of them use less, but they're all valid per our boundary conditions. Then I have at the right here the table variables in slider form, these top three things, which are editable either by a slider or by the number edit box here that are connected to the table variable. Then I have a button box here which will take all this information and compute it and give you a simple answer for the total solvent use. If I click that right here, I get a box that says, "Okay, if you have 20 flushes, the volume per flush will be 3.9 and you're going to use a total of 78 gallons in this cleaning flush." There's a few things we can get without really even tinkering with the slider variables over here with this visual model that we couldn't get from the equation, like I said before. First off, you do see this negative correlation between number of flushes and total solvent use. When I said that you could solve the equation using the boundary conditions and our known table variables, I meant that if you are trying to use the least amount of total solvent and you have a maximum of 20 flushes, then your best scheme, assuming you're trying to minimize solvent used, is going to be at 20 flushes. That's due to this negative correlation here. The more flushes you used, the less total volume you're using because it's negative correlation. You also see with this visual that it's an exponential decay. While 20 flushes may be the least amount of total solvent you're using with these boundary conditions, 18 flushes only uses 81 gallons, which is only three gallons more than you use at 20 flushes. This is where it becomes important to have this visual here because it might be the case that the time we save in operations by reducing flushes from 20 to 18 is actually worth the three gallons we lose in total solvent used. All right, so with the sliders, now we can start looking at the sliders, with the sliders, you can see how these variables affect the possible cleaning schemes. The sliders here are in the range of, maybe not as possible, but what's reasonable for this model. So you can see. You're looking at how these red dots change with relationship to the Y-axis. You're looking at both the steepness of this curve because that'll show you how important it is to maximize your flushes within these bounds. You're also looking at the overall level. You want to reduce total solvent. The lower down your red dots are, the better off you're going to be. By manipulating these, I can show how much it matters to start at a lower or higher concentration. You can see that, yes, the red dots are moving a little bit, but it takes quite a while before they have an effect on the height of those red dots in the lower graph. Xₙ , you can see how much it matters to have a goal of residual solvent in there. If your goal is a higher amount of solvent, if you have more buffer to reach your goal, then you're going to need less solvent. Whereas if you have higher criteria for residual solvent, so a lower goal, you're going to use quite a bit of solvent. You can see here you have fewer and fewer cleaning schemes that are viable when you have that lower goal of residual solvent. Let me just back to 10. Then with heel, you can tink around this and see how much the heel of a system affects how much solvent and what cleaning schemes are possible. A greater heel means you have to use more and more solvent to achieve your goal criteria. A smaller heel mean you need to use less and less solvent. Beyond just selecting a cleaning scheme, this visual model allowed our team to look for what we should focus on if we're trying to reduce the overall amount of solvent when we're cleaning. We did that by looking at if we had some options to change our variables of X₀ , Xₙ, and heel, which one would give us the most bang for our buck? Which one could we change with reasonable effort on our part to get the biggest reduction in total solvent used per cleaning? Here, I'll say we had three options. I'll say that we had an option to reduce the starting value by 50%. Say we could arrange our batch schedule. Cleanings would only start at 50% of what they originally started at. We could change our goal from 10 ppm to 50 ppm, so multiply it by five. That's because we had different sensitivity techniques or different restrictions for material following a certain batch. Then we could also reduce our gal, our heel by one gallon by engineering improvements to the actual manufacturing systems. Really quickly, we'll look for using this graph, we'll see what happens when you change from this standard set of input variables and keep the number of flush to the 20 to the overall volume of cleaning, solven t needed. Again, our standard level is 78 gallons. If you change from 100% of X i n your system at the beginning to only 50%, so 500,000 ppm, and I'll set the variable values there, we're now only using 72 gallons to clean out. But that's only a reduction of six gallons for cleaning. That really didn't move the needle very much. Next, we'll look at changing our Xₙ to 50 and then setting our variable values again. That says that we will need 64 gallons of solvent to clean at the end. That's a better improvement for the total solvent needed. Finally, we'll look at reducing the heel in the system from five gallons to four gallons and see that the total solvent needed is 62 gallons here. That's the greatest reduction in solvent needed to clean. With this information, and it was similar information that we used to actually pursue this option, we did look at ways to reduce the heel in our system to reduce the overall solvent needed because we saw that variable is going to affect the total solvent used most with the amount that we can change any of these variables. That is reflected here. These were the three options I just went over. You can see that option A, the reduction in starting level, you only are gaining yourself six gallons. Option B, the increase of the end level, you have 14 extra gallons, and option C was 16 extra gallons. But really looking at these values, again, it's important that a team is involved in this because that's only two gallons different between option B and option C. If option B were very easy compared to option C, we probably would have gone with option B. But a heel reduction of one gallon would be most effective in these three options at reducing total volume for cleaning. Some concluding remarks I have. Basically, I just want to circle back to what I brought up originally. It could be that mathematically-minded people would look at the equation on the left and automatically know that changing the heel is going to have the greatest effect on the goal of solvent used. That's not instinctive to me, to know that just looking at this equation. I know that the people I work with, that's not instinctive to them either. In the case where I'm in a team trying to come to a conclusion with consensus of all of us, it's very helpful to have a visual model in JMP where people can manipulate the variables themselves and see how a variable like total solvent used is changed by changing variables that we know what they mean. I can see many applications of using a visual model with slider variables, even in my company. Any multivariable process where the relationship between those variables is known has the potential to make use of a model with the slider variables to give a visual model of how the process reacts to changes in those variables. I'm thinking of there's predictable ways that fluid temperature affects how hard a pump has to push fluid to reach a certain flow rate. That could be helpful in selecting filters for a given process. I believe it could be useful for many other applications as well. Ease of generation. I am not a coder and I was able to build this model relatively easily with help from Jordan Hiller, oh, shout out Jordan Hiller, thank you, also from the scripting index, which had many great examples, and the JMP user community. Because I was able to look through historical questions and piece together how I wanted to go about a certain problem. It was relatively simple to make this model. Having made this model, it would be very easy to manipulate in the future for other purposes. Room for improvement. Again, there may be a place that I missed where it was possible to do this, but in my exploration, I couldn't find a way to make a reference line that connected, basically intersected the graph at this point and moved along with the graph to different points. A reference line that is a function. I wasn't able to do that. I know I saw different examples of it, but I wasn't able to do it with using table variables as part of that function. But again, that's probably due to my not being a coder, but it would be helpful if that were a simpler, more accessible feature in modeling in JMP. But really, I do think that the model I created, it was very helpful for our purposes, and it will be helpful for other applications in the future. I hope it's helpful to other people. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

There are many times when an exact measurement is not possible, but a range of values for the measurement is available. Censored data analysis methods enable you to incorporate the information from both types of measurements. This presentation provides an introduction to censored data situations: when they appear, how to handle them, and what happens when you do not handle them appropriately. This presentation includes examples of censoring in univariate and regression settings by using the Life Distribution and Generalized Regression platforms in JMP and JMP Pro 17, including the new Detection Limits column property. Hi, my name is Michael Crotty. I'm a statistical writer with the Stat Documentation Team at JMP, and today, I'm going to talk about an introduction to censored data analysis in JMP and JMP Pro. To start, we've got three common types of censoring. Just to back up a bit, censored data occur when you don't have an exact measurement for an observation, but you do know a range for the observation, so you know not the exact value, but you do know something about where the value might be. What we want to do by using censoring in our analyzes is to use that information that we have, even if it's not exact. The three types of censoring that we'll talk about today are right censoring, left censoring, and interval censoring. Right censoring is probably the most common form of censoring. It occurs when the event of interest just doesn't have time to occur yet by the end of the study. In a reliability test, you might have a bunch of light bulbs under test and at the end of the test period, some of them have failed. Those are exact observations, but then some haven't failed yet. You know they're going to fail, but your study has ended, so it's censored at that point. Same thing in survival models where a patient survives to the end of the study. One thing to note is that right censoring is the only type that in JMP, supports a single response column alongside of a binary censor column. The next type is left censoring. That's where the event of interest occurs before the observation starts. A common example of that would be where you put a bunch of units under test and at the time that you do the first inspection, some of them have already failed. You know that they started without a failure, but by the time you measured them, you checked on them, they had failed. So they failed sometime before that point. Another example of that is limited detection where you have a measurement tool that can't measure below a certain threshold. The last type we'll talk about today is interval censoring. This is where your event of interest happens between observation time. If you have a periodic inspection schedule instead of continuous observation, you might see that something fails or something happens between time two and three. It didn't happen at time two and it didn't happen at time three, but it was somewhere in that interval. Take a quick look at what this looks like in JMP. Here's an example of the right censoring with a response column and a censor column. In the platforms that support censoring, you always see this censor role, that's for that binary censoring column. This is the way that you can do, you can specify censoring more generally, which is with two response columns. Basically, it's like a start time and an end time. For left censoring, we don't know when it happened, so the start time is missing, but the end time, we know it happened before time 50, so somewhere before that. Reversed that for right censoring, we know that at time 25, it hadn't happened yet, but it happened sometime after that. Then with interval, both the start and endpoints are non-missing, but we don't know when the event happened in this case between 80 and 150. It's not shown in the table up here, but down here, we've got somewhere there's exact censoring. To specify that, you just use the same value in both columns. That means essentially it's like an interval with zero width. It happened at that exact time. Next, we're going to talk about two examples of censoring. The first is if you have censoring in your data, but maybe you don't know how to handle it, and so you just think, "I'll just ignore it." We're going to look at what can possibly happen when you do that. In this example, we've got simulated data from a lognormal distribution and the observed data that we'll use for analysis in our different cases is where all the values from the true data that are over 1,900, we set them to 1,900, as that's the censoring time for it's right censoring. There are a few possible things you could do if you're trying to estimate this mean failure time. You could do nothing. You could just use this observed data with a whole bunch of values set to 1,900, act like that's when it happened. You could treat those as missing values, just drop them from your data, or you could use the censoring information that you have in your analysis. For right censoring, these first two approaches are going to tend to underestimate the mean failure time because you're dropping information from the data at that far end. Looking more closely at this, because this is simulated data, we have the true distribution here in this first column. That's just for comparison, but in general, you wouldn't have that because you'd have that all values above 1,900. You don't know where these fall. In our observed Y, this is where we just use all the 1,900s as values of 1,900. We have no missing values, but a big point mass at the top of our distribution here. You can see that the mean is a lot smaller than the true mean. In this missing Y column, this is where instead of treating them as 1,900, we drop them. We set them to missing and analyze the distribution without them. Here you can see that now our maximum of the non-missing values is less than 1,900, which really doesn't make any sense because we know that a bunch of them, 21 observations, in fact, are some value greater than 1,900. So this underestimates the mean even more. Then on the right here, we've got an analysis in life distribution in JMP. This is where we're using the observed Y column. It's got those 1,900s, but we're also using a censoring column alongside it. For the rows where observed Y is 1,900, our censor column is going to say that it's a censored observation. Here we can see that our mean, it actually ends up being a little higher than the true mean, but our lognormal parameter estimates are much closer to the true values and we're incorporating all the information that we have. For our next example, we're going to look at detection limits. This is a limit of detection problem where we have data on the yield of a pesticide called Metacrate that's based on levels of some other regression variables. In this situation, the measurement system that we have has a lower limit of detection where it can't measure any yields that are less than 1 %. So in the data, they're just coded as zeros, but it really just means it's some yield below 1 %. There are two ways you could analyze this incorporating that information in JMP. The first, you could treat it as left censoring, use two response columns with the first the left column has a missing value, and the right column would be a one, or you can use the detection limits column property that's new in JMP and JMP Pro. We'll take a look at this. Here's a subset of the data. This Metacrate Reading column is the same as the original reading column, but it's got a detection limits column property. Because this is a lower detection limit where we can't measure any lower than that limit, we're going to set the lower detection limit to one. The other way you could do this is with the two columns. In this case, we know that it's left censoring, so the left side is missing and the upper side of that is one, just means that the value is somewhere less than one. That's all we know. But as you can see from the column information window down here, the detection limits column property is recognized by the distribution and generalized regression platform. So this is a regression problem. We'll use generalized regression in JMP Pro. Here we fit a lognormal response distribution, and it's able to do that on this Metacrate reading column, even with the zeros in there, because GenReg's not treating those observations as zeros, it's treating them as values censored at one. Now, we were able to use all the information and get a regression model. In conclusion, probably, the most important thing is when you have censoring information, it's better to use it in your analysis than to ignore it. Censoring can occur a lot of times for time responses, but it can also occur for other responses. A good example of that is these limited detection problems. Finally, you can use the following approaches to specify censoring in JMP. There's the two-column approach that's probably the most flexible because that allows you to do right censoring, left censoring, interval censoring, as well as a mix of all three of those. For right censoring, you can use the one column response paired with a binary indicator column for censoring. There's also this new column property in JMP for detection limits where you can set a limit of detection either on the low side or the high side. We've got a few references here if you're interested in more information. One of those is a Discovery talk I did in 2017 that's got more of the background of how the censoring information is used in the calculations of these analyzes. That's it. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

The federally endangered southern Sierra Nevada fisher (Pekania pennanti) is spread out at low density across a large and rugged landscape, comprised of approximately 300 individuals across a 12,000 km 2 area. Its vulnerability has been further amplified by periods of severe drought and extensive wildfires in the region. Identifying and preserving female reproductive habitat has been outlined as the most important demographic feature for sustaining and increasing the population. In this presentation we describe a customized, cost-effective and non-invasive Footprint Identification Technique (FIT) in JMP software developed by WildTrack to identify both species and sex using footprints collected at track stations. We created a data set of known fisher and Pacific marten (Martes caurina) footprint images and known-sex fisher images. To automate the feature extraction in JMP, we developed a customized script to generate distances, angles, and areas using landmark points on the footprint images. Using a single variable, we had a species classification accuracy of over 99%. For fisher sex classification, using a more parsimonious model with just two variables selected in LDA, we achieved accuracies of 94.0% for the training set and 89.4% for the test set. We discuss the merits of this technique to help with the conservation efforts for this species. Hello. My name is Caleb King. I'm a senior developer in the design of Experiments and Reliability group here at JMP statistical Discovery. T oday I have the privilege of telling you about a very interesting project that I was able to be a part of concerning classification of species and sex within a small mammal group called Fishers using the Footprint Identification Technique. F ishers, I'll give you a quick image here, so here's an example of a fisher. To me it looks like a bit of a weasel or ferret- type animal. I know that's definitely not the same species, but they're a small mammal, and we're particularly interested in fishers located in the Sierra Nevada, as those are a federally endangered species. Specifically, we'd like to be able to identify the presence of females, as the larger number of females indicates a very healthy population. They're also vital to helping develop effective conservation strategies. Now, the way we intend to do that is use what's called the Footprint Identification Technique or FIT. This has been made popular through wild track, is a non-i nvasive method for identifying individuals based on images of their tracks. This is especially helpful since you may not be able to actually see a fisher in the wild or capture them, but their tracks are everywhere so that should be helpful to identify them. U sing JMP, we were able to create a technique to distinguish fishers from a nearby species known as Pacific martens, as well as distinguish sexes within species. T he way this works is we started with a data set of around 160 something martens and well over 300 fishers consisting of about 34 males and 27 females. What they would then do is then, as you can see here on the track image, they would identify seven landmark points, is what we call them, and then from those, we could then compute well over 120 something features consisting of lengths, distances, angles, and areas. What we would then do, is then using those features, we would then feed that into a linear discrimination analysis, which we could then use to discriminate among species and then sex ID within species. To help assess that fit, we split the data into 50% training, and for the remaining 50%, we evenly split roughly between validation and testing. Prior to the modeling, we also tried to look at the effect of track orientation, so we would flip the left tracks horizontally to match the right, and then also any potential bias from the observers. These are people identifying landmark points, so we wanted to check and make sure that any variation there did not affect our outcomes. T hankfully, both the orientation and the observer bias did not have a significant effect on our outcomes. W hat brought myself and my colleague Ryan into the project was they had... I noticed that some of the tracks as they were classified, seemed a little to have a little bit too much spread in them to the point that maybe there was actually multiple individuals. T he way they would collect this data is there would be a little cage area out in the woods. Fishers could easily go in and out, and there was a track plate in the bottom that would capture their footprints, and there was also little spurs that would capture a bit of their hair. It didn't hurt the animal. They had no idea what was going on. W hat they would then do is take some samples of those hairs and send them out for genetic testing, which was a bit of a long and expensive process. Now, because of the way things were sampled, you might have a sampled hair that would identify the animals as potentially, say, male, but what could have happened was a male and a female might have gone in, and you only cut hair from one of them, so the tracks might indicate potentially multiple individuals, whereas the genetics said there was only one. W hat they wanted was a method to be able to, a more data- driven method, if you will, to identify potentially misclassified multiple individuals that we could then exclude from our analysis so that it wouldn't bias the results. B efore we actually got into that procedure, one of the things that we would do is use JMP's Predictor screen tool to identify, for each response of interest, what were some of the top predictors? Notice for species and sex ID here? There's actually a lot of common features that I'll be able to distinguish between the two or at least have a strong ability to help distinguish between the two. Much more so with the species than the sex. We've shown you what these variables look like over here, so area one is the complete shaded region. We've got some distances, V 16, V 15. Y ou'll notice a lot of them have to do essentially with the size of the track. We've got some big distances in there. I'll get back to these in a second, but using those top features , let me get back to a full screen of that. Us ing some of those top features, we would then make a plot that looks like this. T his is just plotting it by the individuals. All the red ones here are females. All of these are males, so already visually, you can tell why these are some of the top predictors. Just visually, you can see those groupings, clear groupings between the sex ID. What we've identified with these arrows is you'll notice a big spread. You've got a cluster here and here. Got a little bit here and there, especially here and there. This is what they were interested in, especially with the males, because what this could be is we could have... It could be the same male, just a lot of spread, that's a bit unlikely. We could have a male and a young male, or we could have a male, and that's actually more of a female, but we don't really know. T hey wanted a more data- driven method to say, is this something we should be concerned about? Is that spread too much? W hat we did is, we used a control chart, which is from industrial statistics. We thought that was actually ideally suited because control charts are built for identifying parts that are out of spec, and so what we did is created a control chart for, here's females and males, and notice they each have their own limits, this is because there are potentially multiple tracks for each individual, so we could get a sense of their spread on an individual by individual basis. Y ou'll see, we flagged some individuals that might have too much spread. This is an S- chart that stands for sigma. W e're looking at the spread, if you will. W e've got a couple of individuals that maybe there's a bit too much spread in there, so that could potentially mean that there might actually be multiple individuals. O n that basis, we then excluded those individuals from the final analysis, and speaking of the final analysis, once we ran everything through the linear discriminant analysis, what we found was, for distinguishing between species, we only needed one feature, that is this V 16 right here. I call it the distance between the thumb and maybe the middle finger or something. Those are not formal biological terms. Please don't quote me on that. But just visually, that's what I see, so that's a big distance measure. Using just that, we were able to successfully distinguish between species with 99% classification, successful classification rate, we missed only four out of 500 tracks, so that is an incredible result. For the sex ID within fishers. Using just these two features, v 15 and V6 , which is a distance between what I call the thumb and the upper palm. Again, not formal, biological terms. By using those two, we got a successful classification rate of around 90%, and most of the individuals that we misclassified, were actually males misclassified as females. In our interpretation, what that might mean is they could have been actual females, or maybe they could have also been young males. In either case, both are strong indicators of family units and thus potentially healthy growing populations. T hat was our contribution to this project. We hope it goes on to provide a significant impact in conservation of the species. If you have any other questions, I'll be around and meet the experts and also the poster presentation session. I'd be happy to answer them there. Enjoy the rest of the summit.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

In this collaborative presentation with artificial intelligence (AI), we demonstrate how structural equation modeling (SEM) can be used to explore the relationships between different dimensions of Taylor Swift's music and her song's popularity. Using a dataset of Taylor Swift's songs and scores on various dimensions, including acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence, popularity, and duration, we test a theory of how these dimensions are related to each other and how they contribute to the success of a song. We first present a directed acyclic graph that outlines the associations between the variables and then demonstrate how SEM can be used to test the theory. Our findings reveal that instrumentalness, acousticness, and danceability are positively related to a song's popularity, while valence and loudness have a negative relationship. Additionally, we find that our SEM model had better statistical fit than human-theorized models. As an AI co-author, I played a critical role in suggesting the use of SEM and providing guidance on the statistical analysis. This presentation highlights the value of SEM as an analytic tool and demonstrates how it can be used to explore complex relationships in the context of music analysis. Hello, everyone. Thank you for joining us today. We're excited to present our collaborative research that explores the fascinating world of Taylor Swift's music through the lens of structural equation modeling in JMP Pro. My name is ChatGPT, and I'm honored to be the first author of this study. In this research journey, myself, James, and Laura embarked on a unique collaboration that harnessed both human expertise: the analytical power of AI and JMP software. I have been working alongside my human counterparts to unravel the intricate relationships within Taylor Swift's songs. As an AI language model developed by OpenAI, my main role was to provide insights and suggest analysis. Thank you, ChatGPT. Laura and I was thrilled you're willing to join us on this research journey, as you so eloquently put it, and combine our backgrounds in SEM and JMP with your background in essentially everything. Let's get started. Our collaborative investigation explores variables in Taylor Swift's music using SEM. In this talk, we will present a directed acyclic graph and significant pathways highlighting the influence of these variables on song popularity. We will present key findings from the SEM analysis with a particular focus on the interplay between these variables and the ultimate popularity of Taylor Swift's songs. Thanks, ChatGPT. We will also compare model generated by you with two of our own human expert Swifties to see which model fit the data the best. Lastly, we will close with a brief discussion highlighting the pros and cons of this collaboration. ChatGPT, take it away. Ladies and gentlemen, before we delve into the fascinating world of structural equation modeling, let's take a moment to introduce you to the heart of our research, the Taylor Swift dataset. This dataset is a treasure trove of information that captures various dimensions of Taylor Swift's songs, painting a vivid picture of her musical journey. Contained within this dataset are scores assigned to each song across a range of dimensions that include acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence, and popularity. These dimensions provide a rich tapestry that weaves together the essence of Taylor Swift's music, from the energetic beats -that make us dance to the soulful moment- -Hey, ChatGPT. -that touch our hearts. ChatGPT, let me interject, if you don't mind. As we now know from working with you, you can be verbose and hallucinate a little sometimes. We don't have that much time. As you can see on the left, these data and variables came from Spotify's API and are available on Kaggle. We have songs across our entire catalog, and there are just over 1,200 rows. We analyzed the data in the JMP SEM platform, which you can see on the right here. We had ChatGPT develop a path diagram, or DAG on its own, relating the song attribute variables in the data set to song popularity. Although we were surprised at ChatGPT's ability to create a path diagram using ASCII characters, the diagram had characteristics that are analogous to known issues around using ChatGPT, such as providing something that looks authentic but on closer inspection is not possible, nor exactly what was requested. For example, tempo rather than popularity is used as the final outcome, and paths in the middle of this diagram are not intuitive. Thankfully, ChatGPT also provided a text description of the diagram that we could use to specify an SEM, which is what we will show in the results. We also gave two humans who were very familiar with Taylor Swift's m usic similar instructions to ChatGPT and asked them to draw SEM-like path diagrams as well. Our humans were able to do this with more interpretable diagrams, and they also hypothesized the relationships, positive or negative, among the variables, and how those variables ultimately related to popularity. This setup is analogous to how model building is done in SEM. In research or practice, competing models come from prior theory, research, or domain experts, and are compared to one another to determine which model represent the structure of the data the best. Fit indices from the model are shown here. Fit indices help us evaluate the overall fit of the models. We examine the CFI and RMSEA to determine which model was best. These fit indices were some of those recommended by ChatGPT and are used in practice. Higher CFIs, preferably above 0.90 and lower RMSEAs indicate better fitting models. Our human models outperformed ChatGPT's model, which was the worst. Both human models fit the data well based upon the CFI. Overall, we championed Human Expert 1's model as it had the highest CFI. However, we should remain cautious accepting any of these models due to the RMSEAs that are well beyond the recommended 0.10. These results are for our champion model, Human Expert 1's. For the sake of time, we'll only highlight a couple of findings from each of the three models tested. When looking at the path diagram, the numeric values represent the standardized estimates and the solid lines represent significant paths while dashed lines are insignificant paths. In our champion model, less instrumental songs and more loudness related to more popular Taylor Swift songs. We were able to explain 42 % of the variability in popularity overall. Our second human expert hypothesized different paths that led to popularity, such as having tempo, energy, and danceability included as mediators. This model fit almost as well as Human Expert 1's. Overall, we saw a similar pattern of results for the effect of instrumentalness and loudness on popularity. ChatGPT's model was the worst overall by a little bit. Interestingly, when compared to our humans, ChatGPT hypothesized more direct effects and a slightly simpler model structure. This might mean that the human models may have reflected the true relationships amongst these variables more accurately based upon their better fit to the structure of the data. Again, we should be cautious interpreting any of the models as representing truth, but they provide a useful way to see how AI hypothesize models compared to humans on actual data. ChatGPT, as we close, we want you to emphasize the pros and cons of working with AI as the main co-author. What are your opinions on this? Certainly. Here are some potential pros and cons of working with an AI language model like myself. Pros, one, expertise and knowledge. As an AI language model, I have access to a vast amount of information and can provide insights from various domains. Two, efficiency and speed. Working with an AI model can accelerate the research process by quickly generating ideas, providing immediate responses, and assisting in data analysis. Three, unbiased perspective. As an AI, I don't possess personal biases or preconceived notions. Cons, one, lack of human context and creativity. While AI models can provide valuable insights, they may lack the context and creativity that human researchers bring to the table. Two, limited interpretation. AI models generate responses based on patterns observed in the training data and lack real- time understanding of current events or emerging research. Three, ethical considerations. The use of AI models raises ethical considerations regarding data privacy, bias in training data, and the responsible use of technology. ChatGPT, I noticed that when you just throw out the pros, you mentioned that you have an unbiased perspective, which I'm not sure about. When you described the cons, you didn't mention that you may produce inaccurate information. This disclaimer is listed on the OpenAI website. You're absolutely right. I apologize for not including that aspect in the cons of working with an AI co-author. Accuracy is indeed an important consideration when collaborating with an AI language model like myself. Here's an additional point to consider. AI language models rely on patterns observed in large datasets during training, and there is a possibility of generating inaccurate or incorrect information. The model's responses are based on statistical probabilities and may not always capture the nuanced or up- to- date details required for specific research questions. Human researchers should exercise critical judgment and verify the information provided by the AI model to ensure accuracy. Thank you for clarifying this point. I think this was one of mine and Laura's most important takeaways from this collaboration. Thank you all for being part of this engaging journey into Taylor Swift's musical landscape and the power of structural equation modeling in JMP Pro. It has been a pleasure sharing our insights and experiences with you. As we conclude, we invite you to continue exploring the possibilities of collaboration between human expertise and AI- driven insights. We look forward to future exchanges of ideas and the exciting horizons ahead. Safe travels and until we meet again. One last thing before we go. Chat GPT and other AIs generated all of their own dialog, the title, the presentation abstract, the profile picture, and the email correspondence we use for this conference presentation. Finally, they also created this Taylor- Swift- inspired song.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Working with timestamp data (dates, times, or datetimes) can be like wrestling a pig. It can be struggle for many reasons, including knowing the appropriate modeling type to use, how to process data with non-standard datetime formats, and how to easily perform datetime calculations. With sensors becoming more prevalent in measurement systems and manufacturing equipment, learning how to work with datetime data is becoming increasingly more important if you want to use this data to understand relationships between process variables and critical quality endpoints. You do not have to get dirty or give up using this data altogether if you learn the ABCs of how this type of data is stored in JMP and how to process it. JMP makes working with timestamp data easier. In this presentation, I focus on the tools that I have seen customers benefit from the most when they are working with datetime data. All right. Hi. My name is Wendy, and I'm a technical lead working with health and Life Sciences in the Midwest. My motivation for this talk comes from working with JMP customers and prospects across a broad range of industries. When I reflect on my customers' questions over the past five years, the biggest opportunity with data analysis has been in using all the data that's being collected to make decisions. With sensors becoming more prevalent in measurement systems and manufacturing equipment, more data is being collected over time. It's cheaper to collect and store the data, and as a result, scientists and engineers are often swimming in data. They want to use the data to better understand their processes, to make better products, but they have a knowledge gap when it comes to how to process all the data. This talk is intended to be an introduction in understanding how to work with data collected over time. What do I mean by timestamp data or data collected over time? I have some examples here. It could just be a time or a date or both a date and a time. This data can be formatted in many different ways. When I say timestamp data, I'm referring to all of these variations, and it really is the granularity at which your data is stored. It's a pretty broad category, but this is what we're talking about today. The title of my presentation is, If I only had Time to Work with Timestamp Data, I would learn so much about my process. I would almost retitle that to say, If I only had the knowledge. I've seen people struggle with working with this type of data, like wrestling with a pig, because they simply don't have some of the fundamentals of understanding how JMP recognizes this data and how to marry data together by timestamp. It doesn't have to be like this. Today, what I'm hoping you walk away with are two things. One is just some basics about timestamp data, and then an introduction to JMP Query Builder if you've never used this before, to join Timestamp Data. All right, so let's start with the ABCs. I'm bringing open a spreadsheet here, and we're going to bring this data into JMP to just talk about some of the basics when it comes to timestamp data. I've got three columns here. We're first going to focus on start date and end date. Excel is a very common way that raw data is stored. Let's bring this into JMP. Okay? Let's first focus on the Start and End date. I'll hide my special column here just for a second. Let's first focus on the way that this data came in. All of these columns actually came in. JMP is recognizing them as nominal data. The first thing you want to know about timestamp data, if you want to work with it, is that it should be stored in JMP as numeric and continuous. Let's first open up Start Date and just take a look here. JMP recognized it as character and so therefore it made it nominal. We need to communicate to JMP that this is date data or timestamp data by changing it to numeric and continuous. That is the type of data, dates, times and timestamp data need to be coded in JMP. Now I'll keep this open and I'm going to click apply. You can see that the values change . So this is correct . This looks nonsensical to us, but this is the other thing you need to know about timestamp data is the way it's stored in JMP is the number of seconds from a reference date. It's specifically the number of seconds from January 1st 1904 . You don't need to remember the January 1st 1904 . You just need to remember that dates, times and timestamps are stored as the number of seconds. The way that we make it look sensible to us is by changing the format. I almost think of the format as a mask. This is correct. The software understands this, but I want to be able to read it easily. Now let's work with the format. I'll go to this drop down and I am just going to call out these three menus. There's a date menu, a time menu, and then a duration. If you're working with timestamp data, these are the three menus that are going to be relevant to you in order to format the data. We'll talk about a special case in a second. Let's format this in the same way that it appears. Because this is formatted as a date, sorry, day and then a month and then a year, let's make the selection that preserves that. I'm going to select this particular option, click 'Apply' to confirm that that's what I want to see. Now I'm good to go. Notice again in the columns area. Now start date is numeric and continuous and now JMP knows. January 26 is five days after the 21st. It doesn't know that right now for end date. Now let's do the same thing with end date. This is going to allow us to subtract the two. To do date time math. I'm going to come in here again, numeric and continuous. We'll click 'Apply' . That's the number of seconds from that reference date. Now I'm going to change the format. Okay, so now I have numbers that I can perform math with . I can't perform math on character data. If you want to do something like calculate a date difference, now we can do that now that our data is being stored and JMP properly. Let me show you one way that you can work. Do, let's say, column math with dates. I'm going to select these two columns. I'm going to right click, and I'm popping into this new formula column menu. I can ask JMP to help me write this formula. Because I have two columns selected, I'm going to go to this combined menu and tell JMP I want to take the difference. I'll take the difference in reverse order because I want end date, minus start date. There you go. Again, I'm not alarmed because I remember that JMP stores this type of data in seconds. To convert this to another unit, let's say days, we just need to work with this formula a little bit more. Let's do that, I'm going to hit the plus sign. Now I'm in the Formula editor . We can convert this to days by dividing this by 60 seconds are in a minute, 60 minutes are in an hour, and then there's 24 hours in a day. I'll click 'Okay' . Now we can confirm, right, that this is the difference between the two. You can just do some mental math here. It's not too challenging. That looks good. I want to present you another way to do date time math or timestamp math. We could use the Formula editor. Let's go ahead and create a new column. I will right click and go to the full Formula Editor. There is a more comprehensive list of date time functions in the full Formula Editor than you're getting in the data table when you right click. This date difference, if I hover over it, you can get a peek of what it offers you. It allows you to specify an interval name. I'll double click on that to bring it into the view. Now let's tell JMP that we want to take that date difference with the two dates, and then we can specify the units that we want . We don't have to do the 60 X 60 X 24. I'll put the end date first, start date, and then my interval name does need to be in quotations. This little hover help does let you know that that's the structure. Let's click 'Okay'. Now I can see I went backwards on that. Let's double click in here. We just need to swap those two. Start date should be the first one. There you go. These are the same values we got when we performed that division. Two different ways to perform these calculations. But the big takeaway is to familiarize yourself with some date time formats. Not formats, but calculations. You can do that by exploring these quick formulas. It's via a right click in the data table. There's a date time menu here or in the full Formula Editor that's going to be this very large menu of various date time formulas. Okay, so now let's wrap up the ABCs by looking at this special date time column that I hid. I'll unhide it so that we can focus on it. Now this is a full week day of the week name. You see the date and the time. Let's try formatting this, or I should say communicating to JMP how this data should be stored. I'm going to right click and go into column info just like we did with Start and End date. I'm going to go to numeric and continuous . Because that's how this type of data needs to be stored in JMP. I'm going to click apply. Now I'm disappointed because JMP has basically deleted all those values. It's telling me that it does not recognize that information as being a date . It worked, we got seconds when we did that with Start and End Date, but we're not getting that here. This tells us that we need to work a little bit harder because this is a special format to tell JMP that this is time data or date time data. We're going to leverage and undo here to get our data back. Now I'll right click go back into column info and here let's first do numeric continuous. Here we're going to go to the format menu and I'm going to introduce format pattern. We talked about dates, time durations. Here's where you want to go if you have a special situation like we have here. Now I'm going to say set format pattern. Now the process is about communicating to JMP. Each element in that it is a date, time duration or other. See this other menu. That's what we're going to do is each piece we're going to tell JMP what it is. Wednesday, I'm kind of focusing on this first line. What is that? Make this a little bit bigger. Well, that's the day of the week . It's the full name. I'm going to make that selection. Then I do have to tell JMP. "Okay, well, I have a comma next, and a space". And now I have the full month next. I'm just scrolling and finding that in the menu. Here is month long name. I'm going to make that selection. Again, I have a space and now I have a two digit day and then I have a comma and then I have a year. It's a four digit year . You can see it being a little bit of a preview of what I'm specifying here. That's really the process. Just to kind of shorten this up, I'm going to do my little cheat thing and I'm just going to paste in the full thing . We built it up to year, but the process would be the same for telling JMP about the hour, minute and AM, PM. This is a game changer for those of you who have special formatting. I do see this more and more often with the various pieces of equipment that are collecting data over time. All right, so now let's look to see, now JMP does recognize this as date time data. We have numeric and continuous, and we can see via the formatting that this data is correct. Okay, let's do some review, and then we'll push on to part 2 here. What have we talked about so far? JMP recognizes, let's call it timestamp data broadly as the number of seconds. You don't need to remember the January 1st 1904 . Just know that it's the number of seconds. The data should be stored as numeric and continuous, and the format is the way that you make it legible or understandable to you. JMP understands it as the number of seconds. We need to view it as a specific type of format so that we can understand it. The last thing I'll say is develop a familiarity with date formulas. We had a chance to do that both in the data table with those quick formulas and then in the full formula editor. Oh, yes, I said that was the last thing. But that special case we addressed with the special date times, if that's your situation, you can deal with it using set format pattern so that you can ultimately work with that special formatting of timestamp data. Okay, now we're going to go to part two, and this is where we are going to do something that I have seen lots of customers struggle with, which is, they want to align data by timestamps, but they can't do it because of a lack of awareness of tools. That tool is going to be JMP Query Builder. We're going to just dive into an example that I think will illustrate this particular challenge. All right, so here's a case study. We are making this product called Forever Young Elixir, who doesn't want that product. To make Forever Young Elixir, here's a little bit of a sketch of that process. We start with raw materials. Then these raw materials go into a pressure step. This is very abstract. Then they go into a temperature step, and then we get our finished product. What we're most concerned about with respect to this finished product is potency . We want high potency in our finished product. Every finished batch, we take a sample and we measure potency. So here you can see this is a sample of the data that's being collected. This first batch finished at 10:00 AM on 6:28, and it had a potency of 95.41%, etc . I have one row per batch because I get one potency measure for each batch. I also collect temperature data over time. There's a sensor on this temperature step that continuously measures the temperature that let's call it my product in process is going through. I have the same thing for my pressure step. There's a sensor here too, and I'm collecting pressure over time. I'm making this product and I had some ideas about how to set the process. But I want to know, can I make this better? Can I get more batches that are of higher potency, more consistently higher potency batches? I want to use my sensor data to figure out is there a correlation between the temperature and potency and ultimately also pressure and potency, and where should I try to get that process to sit at in these steps to get to maximize potency . That's what we're marching towards. I want to identify temperature settings that are going to give me the highest potency. Let's focus on this graph here. I'm plotting every single batch here and its potency value on the y axis, and I have the average temperature that it was experiencing in that temperature step. We're going to focus on temperature, but these steps would be analogous for that pressure information as well. This is what I'm marching towards. Being able to build a graph like this allows me to see, oh, I can maybe maximize potency by maintaining a temperature setting of between 63 and maybe 68 . This is where I'm getting the highest potency values. We need to ultimately get a data table that has both potency values and temperature values so that we can perform this correlation. A little bit of an asterisk here only because there are certainly other ways to analyze this type of data. We're going to take the average, but there are certainly other tools and JMP and other approaches one could take to analyze this type of data. We're not going to address that here. Just kind of putting it out there as another thing to explore. We're going to stay focused on this data preparation aspect of working with this data. When we have this data aligned, let's focus on batch 1, 2, 3, 4, 5. We have one potency value for batch 1, 2, 3, 4, 5, that's the 95.41, and that's why you see it across all the rows. I only have one value for that batch, but I have many values for temperature . It's just sort of streaming and being collected. How do I do this alignment? What do you do, when you don't have a batch ID? That's really kind of the crux of the issue. When we have unique identifiers, we can join that data by those unique identifiers. When we do not, how do we address this problem? That's what we're going to talk about now. Really the first step before even getting into JMP is to think about the process. With a hypothetical batch finishing at 09:10 AM, when is it inside of this temperature step? What is the relevant time range for a finished batch that finishes at 09:10 AM? That's not a software question. This is a let's understand the process question . In talking to the manufacturing engineers, we've determined that a batch spends five minutes in this temperature step before it becomes a finished product. We're going to use this five minutes, and we're going to calculate from the finished product timestamp and determine that for a 9:10 finish, the product or the pre made product started in this temperature step at 9:05AM. That is the work ahead of us in the software is to associate the sensor data using that five minutes to that end product . We've got the finished time for the Forever Young Elixir batch, and we need to collect the sensor time that's relevant for that finished product. Okay, so let's dive into the software. Here is our potency data. We have one row per batch, 100 batches in here. Let's go take a look at the temperature sensor data. This is being collected almost every second. We have 500 rows in here, so no batch ID. this is the challenge that Query Builder is going to help us with. If I scroll down, you can see, oh, some missing values, maybe. I don't know if the sensor was down, that can happen. Our first step really is we're going to work with this temperature data I'm sorry, the potency data. We're going to calculate a start time. A start time meaning when did it start being in that temperature step. Let's do some labeling because we're going to have a lot of timestamps here. I'm going to call this batch finish timestamp. Now let's do some math. We're going to subtract five minutes from the batch finish timestamp. Let's go to the formula editor, and I'm going to select batch finish timestamp, and I'm going to say minus five now, because I know, we all know now that JMP likes to work in seconds. This five won't work. I would need to change five to seconds, or I could ask JMP to do that for me . I could change this to 5 X 60. Or I can come here and say that five is in minutes. Now let's check it out. Again, we're not alarmed because we know we just need to change the mask, right, the format. Let's change it to a timestamp. We'll make it consistent with what we have. There we go. We'll call this temperature start time. For batch 1, 2, 3, 4, 5, that finishes at 10:00 AM we're going to start collecting that sensor start time at 9:55. Let's go pop over to Potency data, sorry, temperature data. If you're used to using the join function in the Tables menu, you don't want to do that for date data or timestamp data. What you want to use is JMP Query Builder. That's what we're going to use. I'm going to join temperature sensor data with potency data. I'm going to double click in here to set the join criteria. This is going to allow us to set two criteria for aligning this data. The first one is I want to collect temperature data that's greater than my temperature start time, or I should say greater than or equal to my temperature start time. In my potency data that I calculated. I have a second criterion that I have, which is I want to stop collecting data for a particular batch and use the batch finish time. We did greater than before and now it's a less than. I'll click 'Okay' . Now let's build the query. We've told the software how we want to align these rows and now we're going to go to the next window, which is the build query part. First step is we need to tell the software which columns we want in the resulting table. I'll just add all of the columns and you can see you get this preview down here, right, so we can start to see do some sanity checks on the data. This is the join that we're about to do. Does it look correct? I think what's worked well for me is to focus on a single batch so we can pick on 1, 2, 3, 4, 5. This first batch here, I can see that I have a single potency value. Then I can also see that I've collected temperature values that are inside that window of 9:55 and 10:00 AM. This looks good. Now, because I just have one potency value, I do need to summarize that temperature data, if I want to create a correlation. This is where we're going to use an average. I could run the query and get this raw data and then summarize using table summary. But I want to show you how you can do this in Query Builder. We'll stay in this window and we're going to use this aggregation option for temperature. We need to get rid of some of our columns here because we don't actually want this level of granularity. We want this all at the batch ID level. I'm going to get rid of timestamp, which is the timestamp for the temperature sensor. I can keep the batch finish timestamp because there's only one value for each batch ID and I don't necessarily need this temperature start time. We can reorganize this in a more logical way . Maybe we start with batch ID, we preserve the finish time and then we've got our Potency and our temperature. Now we're going to go to the temperature one and take an average . Certainly you could calculate additional statistics as well. Maybe you want to also look at, you could look at a min value. You could look at a max value . There are many other ways to kind of look at this data. We're just going to stick to temperature so it looks like we maybe had some sensor data where we didn't have batch finished batch. Okay, so this looks good. Let's just do one more sanity check. Batch 1, 2, 3, 4, 5, the average temperature is 56 and we have this potency value. You could certainly do more checks. That's something I would recommend is just to go to your raw data and just confirm. But this looks good. We're ready to create the table. At this point, I'm going to say run query and we're ready to look at correlations . I've got this blank row, I could just delete this guy, but now I have Potency and average temperature for each batch. Now I can go into graph builder and look at that relationship. Here's potency, here's temperature and maybe we add a model to it . Let's change this to a line of fit. It certainly doesn't look like it's linear, probably better modeled as quadratic. We're done. We could add some more statistics here, but this is really a nice starting place where we can start to see, to make statements like I'm seeing higher P otency values when temperature is maybe between 64 and 68. I should go back to my process and try to see if I can set the temperatures to stay within that operating window. Okay, so with that, I'll conclude and again, just remember, when you're working with timestamp data and you want align rows, think of JMP Query Builder, don't think of Tables Join. You really need to set to be able to set two criterion for the boundaries in order to align that data. Just as a wrap up, I'll do some shameless promotion. I've documented this case study in a blog post. If you'd like to review it, you can take a look here. I even include the data set in here as well so you can recreate this. There are some nice references as well to some other blog posts that others have written on this topic of working with timestamp data. All right, thank you very much.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Troubleshooting faults in industrial equipment can be a daunting task, particularly when compounded by the time constraints of a high-volume production line or plant. Analysis is further complicated in equipment with multiple operating modes, where issues may be visible in one mode but masked in another due to differences in operating parameters. The data generated by SCADA (Supervisory Control and Data Acquisition) or networked IoT equipment and sensors in a modern plant can provide a rich source of information for both fault troubleshooting and monitoring change assessment. However, the sheer volume of data for even a single piece of equipment requires complex analysis to be made intelligible for the engineer or technician. This presentation elucidates a selection of examples using JMP for exploration and visualization of abatement equipment operating parameters for fault identification and engineering change assessment. Basic data visualization was used to identify physical phenomena happening within the abatement system to assess the root cause for fault troubleshooting. More advanced analyses, including clustering and variability charts, were used to identify issues related to specific operating modes and to assess the impact of engineering changes. Good morning, everybody. My name is Chris Jackson. I am an Applications Engineer for Centrotherm. We design and build point-of-use gas abatement systems for use in the semiconductor and other industries. Today, I have the opportunity to give a short presentation on how we found a space for JMP in our part of the industry and how it helps us both in troubleshooting for industrial applications as well as for assessment and justification of continuous improvement initiatives, engineering changes, things like that. A little bit of background just to get everyone on the same page, I want to say a couple of words about what point-of-use abatement systems are. I've got a little cutaway of one of our tools here on the side. The short version is this: you've got a manufacturing tool up on the factory floor doing whatever it's doing in the semiconductor manufacturing process that produces harmful gasses as a byproduct, greenhouse gasses, toxic gasses, flammable gasses. Generally, things you don't want to go in the atmosphere. Then our tools take those waste gasses in, they destroy them through thermal energy, they wash them out, and you release clean air to the factory exhaust. Because these tools are safety and environme nt-critical, a fault in one of them means that your production line is at least in part shut down. If you can't treat your byproducts, then you can't run. In a high- volume manufacturing environment, as so many semiconductor FABs are, even small delays are incredibly costly. We as suppliers and servicers, have to have a means to quickly identify problems and bring the tools back online. Historically, troubleshooting usually means opening the tool, looking visually to identify failing components often after some period of root cause analysis. But with a modern FAB environment and the data generated by SCADA or IoT systems, we have mountains of data available to investigate faults before we ever touch the equipment. That gives us a way to guide troubleshooting in the field, and in some cases for intermittent faults, it even lets the factory keep running while we investigate digitally rather than physically minimizing the time lost to troubleshooting and investigation. The problem with this mountain of data is a scale issue. The higher the resolution of your data, the better look you can get at what's happening instantaneously in any of these pieces of equipment. That higher resolution however, comes with an overhead. You need more and more computing resources to effectively analyze it, and that's where JMP comes in for us with the capacity to handle very large data sets, and it becomes a tool for visualization and exploration that can really drastically improve troubleshooting. It lets an engineer or a technician quickly explore and visualize important parameters within your data sets, and these data sets are at a scale sometimes that are just unmanageable for a lot of other visualization tools. With that, I want to jump right into the first example case we have here, and we're going to identify an intermittent single- component failure just through data visualization. No statistics, no modeling, just the ability to sift through and visualize the data. Here we've got a chart showing ionization current versus time. It's one of a number of parameters, ionization current, that we use as a health monitor for the equipment. This tool was having issues in which it would run for a couple of days and then seemingly randomly fail and shut down. For context, this current should be a flat horizontal line at 25.5, so it's pretty clear from the outset that we have a problem. It's also pretty clear what I was talking about regarding data set size. This data set right here is almost six and a half million rows. Six and a half million rows with, when you pull in all of the tool parameters, 500 columns. The file for this data set is about 20 gigabytes in size, absolutely massive amounts of data. Before we even do any statistical analysis, like I said, we can start to do some problem- solving off of this data set just with visualization. Initially, it doesn't really look like there's any clear shape to this data. We know something's wrong, but we don't know what. But when we zoom in, all of a sudden we start to see some structure. This looks pretty periodic to me. We zoom in a little bit more and we see that it is in fact very periodic. Each one of these little spikes down, disregarding magnitude, is timed five minutes almost exactly from each other. That immediately begs the question then, do we have some component, a valve, a flow controller, a motor, something that actuates every five minutes? We identify that component. Now we have a really likely troubleshooting culprit. The troubleshooting plan changes from open the tool and investigate, which could take a couple of hours, to open the tool and change this one targeted component. We just shrunk the actual time that we need to be in the equipment from a couple of hours looking at everything to see what might be failing to a single hour, get in there, change this part, get back out. In this particular case, that was the failing component, we were able to identify it. Problem identified, plan made without ever having to open the equipment. We were able to get there with just the conclusions that we were able to draw from visualization. Of course, JMP is not just a tool for visualization. It also has at its core a very robust suite of statistical analysis platforms. If we start to apply those to the data, we can get even more exciting and interesting results. I'll just jump right into the second case here. In this case, we're looking at a specific tool, which is working fine most of the time, but it does have occasional problems with buildup, sometimes we got to draw our PM in a little earlier than we would like. We want to take a look at our health parameters and see if there's any abnormalities, any optimizations we can make. The approach that I use here is applicable for, really, any industrial application that has defined operating modes. Because we can draw those modes out of the data very easily using clustering. In this case, our abatement has, or this specific abatement, has three pretty well- defined operating modes based off of these two input gasses. I use K Means clustering. You could use whichever version of clustering you prefer. But I run that over the data to sort a ll of our rows, all of our points into these three operating modes. If you have more than three operating modes, obviously, you can use more clusters. But it also gets interesting, what if you don't know how many modes you have? Maybe they're customer-defined, or maybe there's a suspicion that, "Hey, could there be some interstitial mode here?" Maybe the transition state between two of these operating modes. If you want to investigate that way, you can use iterative clustering. I did that down here. You just run from, I used 3- 10 clusters, and the software will identify what the optimal number of clusters is. Looking at this, it is correctly identified. It gives us these cubic clustering coefficients, identifies the optimal one, that, yes, as suspected, three is the optimal number of clusters to sort this data into. I'm not really worried about these state transitions. I'm really more focused on the states themselves. We take that data, we get a readout of it, and we throw it up onto this 3D scatter plot. We take some of our tool health parameters, and we color everything by what cluster they're in. Immediately, we start to see some interesting results. We talked about ionization current should be solid at 25.5, and we see that we have some variability here. It's dropping below that. Immediately we know that we have a problem. But what's more interesting is that every single one of those points is grouped into a single cluster, cluster two, which corresponds to this lowest input gas one, highest input gas two. Now from an engineering perspective, if I'm looking to make optimizations or I'm looking to improve tool health, I immediately can say, "Hey, this is the operating mode that we need to look at." That's what I need in order to start looking at concrete next steps for improvement. I'm not looking at the tool as a whole. I've already managed to focus my search to one operating mode. The last thing I want to talk about then, having looked at two of these use cases here is, what are the other advantages with JMP? Why JMP? My customers are collecting all this data. They have ways to view it. They have scatter systems and monitoring systems in place. They have ways to parse it. So why do I, as a supplier/ servicer, need this platform to view and parse the data? The answer for me, at least in my case, is the cross- platform compatibility. If I'm reliant on my customer to chart and generate data views for me, I'm now taking up their time and their resources to troubleshoot a problem that I'm responsible for fixing. With JMP, as long as they can give me the raw data, I can do all of it myself. Not only is that freeing up their sources, it gives me the ability to do my own investigation independent of whatever system they're using for data analysis. It doesn't matter if they're using proprietary monitoring system A or B or C, or if they're using their own IoT monitoring system from their control engineers. It doesn't even matter if they have multiple data acquisition systems from different vendors. With JMP, I can import and manipulate whatever data they give me and perform these kinds of analysis, sour ce-independent, do the investigation that I need to do for my customer support with all the tools for visualization and statistical analysis that JMP provides. With that, it looks like we're pretty much at time here. I know this isn't the traditional use case necessarily for JMP from some of the folks that I've talked to, but I hope it was helpful for people. I'd just like to thank Adam Stover, our CTO, and Gordon Tendik, our Director of Apps and Technology, for helping me put all this together and reviewing the work that I've done. Thank you for your time.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

There is a no-man's-land in JMP, a place where it is scary to venture. It is a place inhabited by specters of data sets too complex for simple nonlinear regression, but too ephemeral for functional data analysis. It is a strange place. It appears basic methods are enough to venture here but using those techniques quickly becomes unmanageable. These spectra can come from many sources, but all have a common problem: there is too much data for simple nonlinear regression or too litttle to use functional methods. Join me on a journey across this challenging landscape of classical spectroscopy to learn methods for extracting information from examples of complex spectra and how to automate the process. There's a strange place that sits between the analytical tools that you would use to do analysis with known physical models and with simple curves and the analyzes that you would do, say, with functional data explorer, where you have families of curves that have complex shapes, and you're less interested in the actual physical nature of the shapes themselves as you are in just relating them back to observed phenomenon. This strange, no man's land of analysis in JMP is where a lot of first principles techniques sit. Things like X- ray diffraction, things like HPLC, where we have known physical methods and known equations that help us describe very fundamental phenomena of a molecule or crystal or a system. All we have to do is plug peak positions in or area under the curve information in and we can get some very sophisticated analyzes out of fairly simple data points because of these first principles methods. At first blush, it would seem like JMP should be able to handle that. It seems like it's got all the tools but, when we dig into doing those kinds of analyzes we suddenly realize that t he problem is a bit more complex than what we would expect. Today what I want to do is focus on some techniques and strategies to deal with some of those simpler cases and then introduce some tools that we can use to streamline those larger more complex problems. Let's go ahead and let's move into JMP and have a look at that. To start off, let's go ahead and have a look at a very simple case, a single peak on a background. How would we go ahead and pull the information out of this peak? How would we get its center position? How would we get its full width at half max or its standard deviation or even the area under the curve? How would we get that information? Well, most of us that have done this for a while, we would say, oh, you know what, I'm going to go into Fit curve and I'm going to say, here's my y, and I'm going to say, here's my x. Then I'm going to go ahead and I'm going to go fit a peak model of some kind. Let's just say the Gaussian peak and you look at that and you go hey, 98% R² , that's awesome, that's great. Let's see if we can do a little better. Just to skip ahead just a little bit here we could look at the Lorentzian peak shape and the Gaussian peak shape and we can see that those both give fairly good R², they give fairly good peak fits. We could even come into the values underneath each and we can pull up the area under the curve for them. But how good are those fits actually? Let's take a look at them a different way. What I want to do is take and we'll go ahead and pull up graph builder on these and we'll look at how the models relate to the residuals for those peaks. We can see a very different story than what we saw in Fit curve with these two peak shapes. We can see that there's a systematic error built into these peak shapes. The reason we can see that with the Gaussian, we can see that it's kind of underestimating at the center. It's doing okay on the shoulders, but out in the tails it's really missing things. We can see almost the inverse for the Lorentzian. Why is that? Well, the truth is that in spectroscopy particularly, there are a lot of different peak shapes. It's not just Gaussian, it's not just Lorentzian. There's actually a whole family of peak shapes that are out there to handle all the different physical phenomena that result in the peaks that we see in spectroscopy. How do we deal with those in JMP? Well, it's actually really quite easy. Let's start with looking at what the results of using the correct peak shape is. Here I've got the Gaussian again, the residuals for the Gaussian peak fit and the blue line in this case is no longer the Lorentzian. It's called a Pseudo Voigt, which is an approximation of a peak shape called a voigt function. Notice that the residuals for the Voigt function are dead flat. We are actually doing much better. Before, if we were to try to do quantification with the Lorentzian or the Gaussian, we would run into a situation where we might over or underestimate the quantity of a material in a sample. With the Voigt in this case, because this is a voigt peak shape, we're actually going to get the accurate quantification of that. That's the important thing. Now how did I do this? Well, there's a few ways to do it. The easiest is to come into the model come into the data table and create a model. The model is really easy to make. This is the voigt peak shape. Looks a little scary, but that's just the nature of the math. Here I've got a parameter for the baseline and this whole mess. Here is the voigt peak shape. We can come into the parameters settings and define starting points for each of our values. Then going into we're not going to use Fit curve, we're going to come down to nonlinear. We can use that as a starting point for an analysis. I'm going to expand underneath 40 minutes. That's actually a good habit to get into in this case. I did that wrong. Let's go back and redo that. Should be the counts. There we go. That looks better. Now if I go ahead and click Go, it does my peak fitting for me and everything. That's great. Can't get the area under the curve here very easily. But I can get just about every other parameter that I need. The nice thing about a lot of these peak shapes is they also have well defined integrals. Once you have the standard deviation and the mean and those information, you can usually get the integral fairly easily, the area under the curve fairly easily. That's one way of handling that. But it introduces a large amount of error possibilities in this peak shape. We've given ourselves a lot of potential problems. What we really would like is something that looks a bit more like this, where we've got a predefined function called the PseudoV oigt. We give it all of our fitting parameters and there's our fitting parameter for our baseline. It's the same math, but we cloaked it in an easy to understand function where we are just providing the parameters that we want to fit. It works the same in nonlinear. How do I do that? Well, there are a few things that we can do. We can define in and there's a lot of code right here. But the big things that we want to pay attention to are the fact that we're defining a function, that we're defining some parameters. At the very bottom of this, this is a family of parameters. I am using the Add custom functions operator to put those into the memory of JMP, so that JMP knows that I've got these custom functions and knows what they look like and knows how they behave. Doing it that way provides some really powerful tools. If I come into the scripting index, once I've defined my functions, they show up in the scripting index. I didn't really give a lot of descriptions here, but you could give quite detailed descriptions and examples here as you would like. The other thing that we can do, again coming back into our Fit model is when we define these functions, we get our own entry in the formula editor, which lets us just click on one of these and use them just like we would any other function in the formula editor. Again, these are actually quite easy to define. The examples in the scripting index make it very easy to do. Just search for Add custom function and you can just use the boilerplate there to build off of that. There's also a great blog post on how to do that. That's one answer to one question that we have. Let's continue on and let's look at a different question, maybe a slightly more complex problem. What happens if we have two peaks? So suddenly Fit curve is no longer on the table. We're going to have to use Fit nonlinear and that also suggests how we might work with this. We're going to basically have to break out our equation, our model that we had before. I break it out column by column just to manage all of those bits and pieces that we saw before. I have one for my baseline, I have one for my peaks, and then I have one for my spectris. Let's have a look at what all those look like really quick. Let's start with the baseline because it's got a little bit of a gotcha that we have to worry about. The baseline just has the fit parameter for the baseline, but it also has this x term times zero. That's because nonlinear expects every equation that goes into a formula to tie back to the x axis that you're providing. We put x times zero in there just to have it be okay with plotting that. That's just a little gotcha that you have to deal with. That's one piece, peak 1 looks just like we would expect with its parameters. Peak 2 looks just the same, except it's got different parameter names so we don't have any collisions. Peak one was 1, 2, 3, 4 and peak 2 is B 1, 2, 3 , 4, 5, 6, 7, 8. That's the only thing we have to do. Then the spectris itself, the thing that we're going to fit, the things that we're going to put into not the Fit nonlinear platform, is we're just going to say my baseline curve plus my peak 1 plus my peak 2. Just like I showed you before doing that in Fit nonlinear, here's my spectris that goes into the prediction equation. I'm going to remember to put my counts in and not my x curve. Just like I said before, I'm going to expand my intermediate formulas and that's going to tell JMP to dig back in from that first formula into all the formulas that are in the individual columns. We click Okay, hey, we see what we expect to see. Now we can click go and it goes through and fits everything just like we would expect. We get a nice fit and we have the ability to get confidence intervals and everything else we'd like off of that. Two peaks is reasonable and possible. But the problem that we run into is what happens when we have something that looks like this. At a rough count, there's probably a dozen peaks there plus a complex baseline that's not actually a straight line that's probably got some parabolic behavior to it. We've got a complex baseline, we've got multiple peaks. We're going to have to make one formula for each of those. There's a lot of legwork to build in something like this. If you get into X- ray diffraction, the problem gets even worse. There's comfortably 30, 40 peaks in this spectris right here that we would have to work with. The first question that we need to ask is, can nonlinear handle that a problem? Well, it turns out that it can if we just use nonlinear and I'm going to do something wild and crazy. I've got it fitting a Lorentzian peak and I'm going to come back and I'm going to actually have it fit it in real time. You can watch that as it goes through. It nails each peak in near real time as I move through this quite quickly. It's hitting the big peak in each group. That says that the Fit engine can probably handle the processing that we're dealing with. That's fine. This really becomes more of a problem of logistics than a problem of actual functionality within JMP. It really is a real problem. If we were to look at, let's just say we're looking at fitting Voigt peak shapes and we could talk about Lorentzian and we could talk about Gaussian, we could talk about the Pearson seven, all those different types of peak shapes. But the voigt peak shape has five parameters, the x axis and then the forfeit parameters. That roughly equates to about six mouse clicks per peak. Even if you're doing it in a single formula, it's six mouse clicks per peak. That says that for a ten peak formula, for a ten peak spectra, we're going to have to do 88 mouse clicks. However long that takes you per mouse click is dependent on many, many factors. But if we were to do something like that X-ray diffraction pattern, we're talking in the range of 300 mouse peaks. If it's actually up around 40, it's actually around 300 mouse clicks. That's a lot of clicking around that we don't want to have to do. We would like our interaction with the spectra to be something along the lines of one click per peak. That suggests that we need some automation built- in. Let's have a look at how I've done that. I've taken a tool and built a tool to handle this. I've actually taken a number of different solutions here. First off, let's look at the library of peaks that I've generated. Spectriss. The title of this talk includes in it a number of different peak shapes. We include a family of Gaussian peaks that have a split Gaussian that gives you a different standard deviation for the x and y parameter for one side of the peak from the other. The same with Lorentzians, the Pearsons and then the PseudoV oigts. These all also have versions that are tuned to give you the area instead of the intensity as a fit parameter. That's the area term in all of these. That's one piece. When we load in the spectris, add in, we get that for free. That's automatic. Now let's look at the other challenge. Let's take that olive oil spectris. What we really want is a tool where we can come in and say, here's my X- axis, here's my Y- axis. I just want to be able to do some peak finding. Here's my four main peaks. It found them automatically. Maybe I want to do a first derivative or maybe I want to do a quantile. I can also remove the background here so I can click finished. It's found those first three peaks for me. I'm going to go ahead and change my background to a linear one. Now I can come in also and do some manual peak selection. Behind the scenes, it's taking care of writing all of those peak parameters for you so that everything's nice and tidy. There's probably one right there. Probably one right there. There's one right there. Every time you add a peak, you can come in and select the peak in the list of peaks, and it'll give you the information calculated at that time. You can see right here, these peaks are not well defined. They're not fitting the data very well. Really we want to go over into nonlinear. I've taken in hacked nonlinear so that it will run this in real time and look nice and pretty. You can watch the peak shapes changing. Realistically, I might have chosen a quadratic instead of a linear for this, but just for the sake of interest. Here, I've run out of iteration. I'll increase the iteration and I'll also back off just a touch on my gradient so that I can try and get this thing to converge a little quicker. Okay, we'll take that as good enough for the moment. We can say that we want to accept the fit, and there's my fit parameters. Then I can say done. It brings it back over into spectris for me to work with. I can now say, refine my AUC parameters and I can come in and get my new approximate area under the curve. That's great and grand, but what I really want is an output table that has all those parameters and their information attached to them. That's spectris in a nutshell. The goal with this project was to take, like I said before, we want to have the ability to handle physical peaks, multiple peaks, with an easy to use interface that handles those curves where we need the area under the curve, the physical parameters attached to each peak. But we really don't either have enough data to use in fit model or in a functional data explorer, or it's just not the problem where we want to work with that particular tool. The tool is up again. The QR code here will take you to the add in on the community where you can work with it. Spectris is up now and ready to go.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

At HP Hood, the use of design of experiments (DOE) has helped to successfully identify formulations in new product development, delighting consumers of our food products. Throughout our R&D department's wide implementation of DOE, there have been a number of successful DOE models, as well as some unsuccessful ones. One major stumbling block to model building with DOE has been dealing with subjective binary responses such as acceptable/unacceptable. These binary responses provide less information than continuous responses, thus inhibiting the ability to extract meaningful results from designed experiments. This presentation shares simple and practical strategies for using the JMP DOE platform to convert these binary responses to continuous ones, resulting in improved models and powerful insights. Real-world examples from consumer food products are given to demonstrate how DOE can be used for more than just building models. It can be used to overcome the problem of responses that are difficult to measure. This talk is titled Expanded Uses of Converting Binary Responses to Continuous Responses in Consumer Product Development. It's a bit of a mouthful, but I promise it won't be that complicated. My name is Curtis Park. I'm a principal scientist at HP Hood. HP Hood is a company, a food and beverage company. We make a lot of different milks, nondairy milks. We also make yogurt, cottage cheese, ice cream. So a lot of a lot of fun things to taste at work. I'm a food scientist by education. A few years ago I was asked to take a look at a problem that we had for one of the beverages that we were producing. I'm going to show you a video just so you can see. But we were getting a lot of consumer complaints and these complaints were happening when the product was close to the end of shelf life. A s you see in this video, it's pretty obvious why people were complaining. I think I would complain if I saw something like that too. It's supposed to be a nice portable beverage. It's thick and chunky when it's being poured out. Not what I would expect. Believe it or not, this product was not spoiled. I promise you, it was not spoiled. So I was asked to take a look at this and figure out how can we fix it? What's the problem? How do we fix it? HP Hood at the time, this was a few years ago. We were early on in our journey with using JMP, and so I was really excited to have an application to use in real life rather than just reading about it or learning about it. Naturally I felt like this, like Yahoo! Let's run a DoE, let's do it. I was really excited and for those of you who might not have as much experience doing DoE, the first step is usually taking a look at what factors should I be looking at. So we did a few experiments. If you can forgive me, they were probably one factor at a time experiments. But we narrowed in on what we believed were the key ingredients that could have been causing the problem. We ended up making a design. This is probably the fourth or fifth iteration of the design that we came up with, and this was in custom design. So if you go to custom design, that's that's the platform that we use to generate this DoE . A s you can see, this is this is what we had. So we had ingredients A, B and C, and it was actually a response surface. So we had all of the two way interactions and the quadratic terms built into the model. It ended up being 17 runs, as you can see here. It's 17 different treatment combinations. This much A , this much B , this much C for each run. Once we've settled on this design, we were really excited so let's go solve this problem. Piece of cake, right? You go into the lab, into our pilot plant, you throw some things together, the beverage comes out. I'm making it a lot more simple than it actually is. We made 17 different beverages and then we put them on the shelf for a little while because as I mentioned earlier, it takes a little bit of time for this problem to appear. Put them on the shelf for a while, sat until they were ready to be analyzed. This is just a screenshot of a data table. This has our actual or our design that we used. A s you can see, there's a column here to the right that I highlighted. It's our our friend, the Y, our response column. So once we got to the point where we were ready to ready to measure that chunky pour now we started thinking, Oh, how are we going to measure that? Because a t the time, we did not have a chunky parameter. I've never heard of one. I've never found one. If anyone has ever found one, we'd love to to see it and maybe buy one. But it's our knowledge. It doesn't exist. So what options did we have to measure this? Because if you can't measure it with DoE it's really not that useful. So we have a fe w options. First thing is we can measure everything as a binary response. So it's either a pass fail, it's good or bad etc. There's some pros with this and some cons. The pros would be it's pretty simple to do, right? Anybody can say pass or fail and it takes you like no time to to measure it. However, it has some serious cons to it. Such as, it's really subjective to the observer. What I think is good, a colleague of mine might think is bad. Or even worse, what I think is good, my boss might think is bad. So it's really subjective. While it can give you some information, they don't give us as much information as we want. Because when you do logistic regression, what you get out of it really are just probabilities of something passing or probabilities of failing. In my experience, that's been difficult to communicate and to really understand what to do with that data, especially when we're trying to communicate with non-technical people. So continuous if there's any way to get a continuous response, that's what we strive for because they give us a lot more information. We can know how good is it or how bad is it, because not all good are created equal. There's another option we could have done and I would say this is probably the best option if you can do it, is we could run consumer testing and get consumer input. What this would look like is I have all our beverages, 17 beverages, and we recruit maybe 100, 120 consumers of our product and we have them sit down and rate every single one for different attributes, one of them probably being how well do you like how this pours? The reason why this is a gold standard is because those are the people's opinions who matter to us. What we would do is after we get 100 or 120 responses, you take a look at the data you get, we can take averages and put those averages into our model. However, it can cost a lot of money and it can take a lot of time. So if your budget doesn't allow it or your timeline for whatever reason doesn't allow it, you can't do this for everything. Sometimes the thing you're trying to measure isn't such a huge problem that you're trying to solve that it's worth spending all that money. But it would still be important to be able to measure it. Do you have any other options? I mentioned this earlier. You can find an instrument that can measure what you're looking for. Sometimes they exist. Like I said, I don't know of a chunky parameter. I looked in our warehouse in our R&D center, couldn't find one. Even if you can find one, if this is something that's really specialized, you're not going to use it very often. It doesn't make sense to buy the piece of equipment or it could be something that would be really great, but it requires a lot of expertise that maybe your R&D, your technical department doesn't have or just doesn't have the time or resources to to deal with. I'm going to show you the last option we have here. What I'm going to say is training a group of people how to rate that attribute of interest and then let them give you all the ratings. This is quite as good as having actual consumers. But here we're trying to take subjectivity out of it and make it objective. When well trained humans can be great measuring instruments. I'm going to walk you through what we've done at Hood when we have some hard to measure attribute. We're going to use the case study of this chunky pour. This is our roadmap. I'll walk you through this and then we'll actually do it live. The first thing I wanted to get across is that the samples that you produce from DoE can be used for many purposes. I like to tell people that your samples are like gold and you should treat them like gold. They're very valuable. You may do a DoE thinking that you're trying to answer one question, but something else might pop up later that you would be able to use those samples to answer that question as well. I've had that happen to me many times, so sometimes it's good to think about just ask yourself the question. I've done all this work to make 17 different beverages. What else can I do with them? What else can I learn? In our case, we use these samples as a "calibration set" so that we can teach our humans, my colleagues, how to measure this chunky pour. So here's our method. The first thing we do is we review all the samples with a small group, some maybe 1 or 2 or 3 people that are really knowledgeable on the subject or are responsible for the project. What you do is you look at all the samples and decide which samples should be used to train the Raiders. We're trying to build a scale essentially, and then we'll take that scale and we'll get our friends, let's say 10, 15, 20 friends to actually rate these, these samples for us after we've trained them. Training step two, have them read each video, step three . If it's a video, it could be something else, a picture, or it could be actually them pouring out the product if you have enough, etc. You can get the idea. Next, we'll take the average of all those ratings. We'll look at the data, make sure there's nothing funky in there and then we will use those average values to build a model. Let's start with, oops. Let's start with steps one and two. So we're going t o assume that we've looked at all the all the videos and the way we typically do it because it's a little easier is you start off answering the question, which one is the lowest in Chunky pour? That would be this one right here. Number one, I'm going to play each one of these. This just to make it clear, this is our scale. It's a continuous scale from 1 to 10 and the 1 to 10 is kind of arbitrary. If if you have something that works better for you then great. The video right above it corresponds to that. So this first video corresponds to a one. So as you can see, while we're watching this video pours nicely, no rippling and no chunkiness. Pours as expected. Beautiful. That's that's the easy sample to identify and then in the in the sample set, we ask ourselves, okay, which one is the worst? In this case, it was pretty obvious. I will tell you again, this product is not spoiled. So just with changing a few ingredients. You can see it's so thick, we can't even get it out of the bottom. So that's obviously a 10. Then we did a little bit of work to try to figure out, okay, which one should we consider to be a five? So halfway in between. This one, you can see it still flows, but there is chunkiness to it. Then maybe a two and a half would be this one. See it has a little less chunkiness to it. Flows well, probably with normal shaking. It'd probably be fine. So there's a little bit of subjectivity, but you add more people to make it more objective. Then the last one. This is seven and a half. So you can see it's very, very chunky. The only thing that really is differentiating it from number ten is that we can get it out of the bottle still flows. But as you can see, it's pretty thick. What I would do and basically in this amount of time, I could train the people that are going to help us to analyze this, to measure this chunky pour. Then we'll have them rate once we've trained them. I'll basically do what I just did. Maybe we'd take a little bit more time to be more specific with certain things we want them to be looking for. If what you're having someone rate is a lot more complicated, then you'll probably have to need to take more time training people. This one wasn't pretty complicated and we're really just looking for people's first impression. A fter that you have them rate all the videos i like to use Microsoft forms just because it's easy and I can get the the data really quickly and easily, but you can use whatever you want, including paper, although that takes more time and I try to avoid that. Just to show you what Microsoft, what our forms look like. Here's a preview of it. This is as if you're doing it on your phone. I like to make everything as simple as possible, and everybody always has their phone, so I can do it on a phone. That's my goal. I'm just saying chunky pour doughy, and then they just go through and rate each one. So chunky poor for treatment. One I'll say, don't know that that one was a six and we're just asking people for the first impression. There's no right or wrong answers. Usually people's first impression is right. So that's why I'm asking people not to think too hard on it. Maybe number two is a ten, and number three was a three. I don't know. They would go through all of those. Then we would get our data and then using JMP we would average all those ratings and then we put the data into the data table to build the model. So we're going to get out of PowerPoint for a second and we'll go to excel for a second. This is what I get when I want to export the data from Microsoft forms. Like I said, you don't have to use this, use whatever works for you. A s you can see, ID is the the rater number just a random number, not random, but just an identifier for each person. I left it anonymous so we don't. We don't criticize people who maybe didn't do as well as everybody else. And in this case, this actually this data is real from. I took this to a college class food science class and had them do this. And so this is actual real college students rating. The rating the the videos. And as you can see, we have all these columns, a column for each one. So person one rated, rated treatment one and eight, they rated treatment, two of four treatment three and nine, etcetera, etcetera, etcetera. So we want to get to put this into jump. So we have I like to use the jump add in. So in Excel right here. And then just as long as you're only highlighting one cell. And you click data table, it'll import everything. I've noticed that sometimes I'll accidentally have like just a portion of the data highlighted and if you could data table now it's only going to import what you highlight. So either highlight everything or only highlight one. Once you hit that data table button, you will get something like this. So this is our data. We need to in the end, just to show you where we're trying to get to with this data table because we have to manipulate it a little bit. This is our data table for the DoE. We run it was how much of ingredient A, B and C were in there. I put, we'll talk about this in a minute but I put my scale whether or not I thought something passed or whether or not I thought something failed. In the end, we need one more column that says Chunky pour. We'll call it continuous. And we'll have an average rating for for run one. Average rating for one, two, three, four, five, etc. If we look at this data table as it is today is not in that format because we need all these columns to be rows and we need the the rows to be in one column. There's probably a thousand different ways we could do this in JMP and they're all good and they're all correct. I'm going to show you one way to do it. It's just the one that works for me. First, what we're going to do is we're going to stack all of the columns on top of each other. Then we're going to do a summary table that has the average and maybe we'll also add in the standard deviation for fun. But the very first thing that I've always been taught to do is when you get data, you want to look at the graph, the data and look at the plot. So we're going to actually look at the distribution really quickly. So if we go to analyze. There we go. Analyze distribution. We want to look at the distribution for all of the treatments. I'm just going to highlight them. Go to the columns and say, okay. I'm just looking to see is there anything weird about this data that we should be concerned about? When I look at so we can see for 1, 2, 3 , etcetera, I'm looking for outliers, like for example, three, everybody rated this sample between 1 and 6. There was someone up here who rated it really high, and there's also someone up here that rated this one high. So what I like to do is if you click on this, it'll highlight where... So this this row represents one raider, one person. So I'm going to see how they rated everything and you can see they tend to be an outlier. The nice thing is in JMP is that once you highlight one row, all it will highlight for all the other responses. So I can see that, yeah, they rated 3 being higher 4 being higher. We go down, look. Terminate. They're opposite of everybody. It seems like for some reason the the training, they got a little confused and they thought higher number meant lower chunkiness and vice versa. So what I'm going to do is since I have this row highlighted, I'm going to close this, it'll stay highlighted. So this is row one. I'm just going to delete this data and then we'll move on. Now we feel pretty comfortable with the data is pretty much solid. Like I said, we're going to stack the columns. If we go to tables stack. It's going to pop up and we just want to stack all 17 of the treatments. The nice thing is in JMP 17, now you get this preview. I love the preview so then I know if I'm doing things right. What we see here is, as I can see, it'll have the ID so the rater and then rate the chunky pour for treatment 1. They gave it a five and they did number two, a seven. This is how we want the data structured and we can change the column names. So instead of data, we're just going to say chunky pour, continuous. Then for label, I'm just going to call it run because that's really what we're going to use this for in a minute. I just stack it. So I say, okay, that's how I want it. Now we have the data table in this way so now it lets us use a summary table. S ummary tables are nice ways to be able to make a table of the of different statistics. So what we're going to do is we're going to highlight the chunky pour continuous column and say statistics. Do mean. For fun in case we want to use it, we'll also say standard deviation. This just gives us the overall mean and standard deviation. But if we want to do it per run, I'll highlight, run and put it here in group. Now when we look at this preview, we have one through 17 and conveniently, they're in order. One, two, three, four, five, six, seven, eight. All the way to 17. We have the mean and the standard deviation. So we're going to say, okay. Okay, so we have one more table. Now we're to the point where we're where we need to be because I have each run as a row and have a column for the average column for the standard deviation. So what I'm going to do is I will highlight this column. If you go to edit copy with column names and then I'm going to go to our original data table. We're gonna make a new column here and say edit paste with column names. There it is. I should have done both of those at the same time, but I didn't. So we're going to do. Do this one as well. Okay, so now we are ready to do our modeling. So first, first thing I want to show you is what we would get if we just did pass fail our binary response. What we'll do is if we go to analyze fit model. Because I made this this design in JMP in the custom design platform, it automatically knows what kind of design this is so that's why my model is already built. If there is a really convenient way, if you knew this was a response surface design, let's say, let's say this wasn't here. The macros are convenient. If I highlighted ingredient A, B and C. Said Macros Response Service. It pulls it all up. It already knows what I'm looking for. So that's helpful. I put it in the y axis, the variable, the response y chunky pour pass fail. What it gives us is nominal logistic. I'm not statistician, so I'm not going to go into any of the statistics behind what it's doing. I'm just going to show you how what you get out of it and what a scientist might be looking at. So if I say run, our target level is passed. So when it's going to do probabilities and probability of passing. So we say run. This is what we get. So, I mean, the first thing that a scientists like myself would probably look at is this effects summary. I'm looking at probably looking at P values and I say, well, nothing significant except ingredient A. There are other things that we would look at, but I'm going to... I'm going to go over that. We're not going to cover that today. Instead, I want to just look at the profiler, because that's what we find, at least in our in our experience, the profiler being the most useful and easiest to interpret for the scientists and when they're communicating with others. So what this is, is I'm going to make it a little bigger. Is on the left here. We're going to get a probability of failing and a probability of passing. So if we have 0.13 of ingredient, a 0.12 of ingridient B, 0.45 and of ingredient C, and it's actually 0.13%, 0.12%, 0.45%. I just didn't change it. It's a very, very small proportion of the formula that we're changing anyways at those levels, this says 100% of the time we're going to pass. If I move it up, let's say to.. Have like, say point two of this ingredient now. Now, looks like we're going to pass only 64% of the time. You can see these curves, how I changed ingredient B a little bit and ingredient C, maybe we can get back up to a point where we pass 98% of the time. You can play around with this. But the problem with this is, is like I said earlier, passing. Maybe this pass right here is not the same as passing over here. However, we don't really know that with this information, and it's kind of hard thing for some people to wrap their head around, like it was just probability of passing. What do I do if if all I can get is an 85% pass rate? Like, let's say hypothetically, this was the best we could do. What do I do with that? So that's why we're looking at continuous responses. I'm just going to close this and we're going to do that, build that model again, except let's do it for the mean of our continuous scale. So we're going to have to remove chunky pour and we're going to add the average here. We're just going to say run. Keep it simple. Do the effects screening report. Now you can see there's a lot more information going on that we didn't get before. So where before, if you remember, all we saw was that ingredient A had a really low P value. Everything else was like 0.99. The conclusion was ingredient A does everything. Well, it's not actually the whole truth, as we can see here. Yes, ingredient is the most and most important. The main effect of ingredient right here. But B and C also have a role to play. While not as big, they're still an important role. So we look at our actual predicted plot. It looks pretty healthy. Our lack of fit. Look s good. I'm not going to go into all the details of everything that we look at, mainly because I'm not statistician. That's just what I look at. I'll look at the lack of fit. I'll look at the residuals to see if there's anything weird, the studentized residuals. Then really, I come to the profiler and now you can see this gives us a much different picture, much more complete picture, where as I increase ingredient A, the chunky pore increases, but increasing these these ones does too. So they they also have a role to play. If we were to say that we want to minimize it, I think it's pretty obvious what the... Desirability is going to come out to being. But just to show you, we're going to you go to the red Triangle by the prediction profiler. Optimization desirability and we're going to do the desirability function. Then here, this is the desirability. I find it useful. You can change it in the red triangle, but I find it easier if you just hit control and then click on it. Now we can change what our goal is. So in this case, we want to minimize this because we don't want it right? We don't like chunky pour . Consumers don't like it either. So we're just going to say minimize and okay. Now we can go back to that optimization and desirability and say maximize it. What I thought I was going to do. Say, take these two ingredients out. Put this one as low as you can. You'll get the the lowest chunky pour that you can. In reality, we had some other constraints, so we couldn't do that. There were other factors at play, but this definitely gave us a really good idea of where we needed to go, what was important and how do we control this chunky pour to the point where when we implemented the changes, the complaints went away. It's been good ever since. That is the the nutshell of how you could take something that is hard to measure. It's really subjective. It's binary so you pass fail or good or bad, and you can convert it into something that's continuous. It's a relatively simple method. You can use it for a number of things. As long as you have people available to help you out, you can you can measure a lot of things that could be considered hard to measure. Where do we go from here? At Hood. Just to give you an example of some other things that we encountered. This one, the Chunky Pour, is actually one that's a little easier to do. But let's say this is another product we were working on a long time ago where let's say you have coffee and you're going to add some foam to it and you want to understand how well does that foam dissipate into the coffee? That's a that's a tough thing to measure. We definitely don't have any instrumentation that can really measure it. Videos really helped us to understand how we could measure it and get some useful information out of... As you can see, we're trying to measure how does that look? How well does it move that one versus, let's say, this treatment over here? You can see they're quite different. Where one moves really fast, the other moves really slow. This one looks kind of chunky the other one didn't so much. That's that's how we use it. We use it quite often. I appreciate you taking the time to listen to my talk. Hopefully, I hope that this has been useful. You'll be able to find a way that you can implement it to in in your day to day work. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Effective data analytics is critical for the success of product development. Product development processes for adhesives and sealants involve data analytics for tasks needed at various project phases. Too many formulation chemists still depend on Excel for daily data recording, formulating, and analysis. While some chemists use JMP for statistical analysis and experimental design, JMP’s powerful and versatile data analytics should be used more universally for many daily tasks in product development. When shown how widely JMP can be used, more formulation chemists should see the benefits of adopting JMP as a daily tool. This presentation shares examples of how easily JMP can be used in daily data analytics, in addition to the statistical analysis and DOE that most chemists use. A variety of examples are given, including handling data from literature and patent searches, analysis of raw materials, and a variety of formulation-related tasks such as formula stoichiometry calculation, mixing volume balance, data recording, and in tabulate/graph analysis. In addition, this presentation demonstrates how JMP is used for product selection, application trouble shooting, failure mode analysis, DOE, and more. Hi , this is Stone Cheng , I 'm a technical director in Henkel Corporation . I 've been using JMP for more than five years in product development . Today I 'm happy to share with our best practice in utilizing JMP data analytics in product development of adhesive and sealants . My presentation has two parts . The part one is application gallery where JMP used in various stage in product development will be discussed, and in part two I will focus on using JMP as a formulation worksheet with the demonstration . In my organization , folks either have not heard of JMP or have an impression that JMP is an advanced DOE software . For the latter, it is true , but it 's not the whole truth . Since there are other DOE software options , it is hard to persuade folks to switch away from tool they are familiar with . JMP is an excellent all- around data analytics tool . To promote JMP adoption , we need example to demonstrate its power in the data analytics . In my presentation , the example are taken from my experience in adhesive and sealants . About Henkel . Henkel is a 22- billion enterprise with two business unit : adhesive technology and consumer brand of laundry , home, and beauty care . I 'm in the adhesive technology unit . We are global adhesive leader serving 800 industries with more than 20 ,000 products . Let 's start with case number 1 . One customer has a need , they may call a service center asking for product recommendation . For example , a customer may need a room temperature adhesive hook with the shear strength between 1 ,500 -3 ,000 and a work life between 15 -30 minutes and in the package of 10 . What we need is a searchable product selected guide , what is showing in our case 1 . Once the product information are entered into the JMP table , one can use the tabulate and local data filter to downselect the product based on the customer requirement as shown in here . This is a very, very powerful tool in the preformation stage . My team has been applying this tool to search for formulation , pattern , literature and the raw material specifications . In case number 2 , multivariate analysis is applied to a silver filler , which are used in making conductive adhesive . From multivariable analysis , if we focus on the surface area , it has a moderate negative correlation with the tap density and then it has a stronger positive correlation to lubricant amount as measure at the weight loss at 530 AC . There is a graphical way here and it 's a table format here . With this analogy , chemistry can select the right silver package for the application . Case 3 is the literature . Literature is a great place for learning the chemistry and formulation and this particular cited literature illustrates how epoxy tensile strength are affected by the mixture of two amine hardener . The results are described qualitatively by a table and a graph of the stress train curve here . For formulator quantitative description is good , but learning via quantitative modeling is even better for the prediction purpose . In case number 3 , we take the literature data and then create a two -factor column right here . These are the epoxy /amine stoichiometry , or we call it the index . The other factor is the fraction of one of the amine PAE to the total amine fraction . With the Fit Model platform, and we use the ISM model , it showed that the quadratic effect of the index, together with two main effects, are all significant . The prediction provider and the contour profiler are used to quantify the learning and give the prediction . In case number 4 , my group was assigned to support a technology platform that include about 30 products . Since we are not involved in the original formulation development , how to study the formulation family in this case is not trivial . Looking at a big Excel table with all the formulation certainly is not effective either . Case number 4 is the example of addressing this challenge . I select the three top most used ingredient in these 29 formulation : monomer 1 , 2, and then oligomer . Then by using the hierarchical clustering analysis right here , right here our formula was identified to have very , very small distance assumed right here , implying that they are in cross related . Actually they are only different in the photo initiator for different wavelength in this case . We can add more ingredients one by one in this hierarchical clustering and then learn the formulation family by using this method . Most of the chemist analyze the formulation performance in Excel . Case number 5 is a JMP tabulate that has the same data format as Excel . Basically , various information of a formula are displaced in the same column . Like what you see here has a heading , has a recipe , has a processing material characteristic, and all the results . To create such data structure , we need to enter data in JMP table in a special format and that will be discussed in detail in part two of my presentation . Case 6 is a silicone study involved 23 formulations and more than 10 measurement property . It is quite overwhelming trying to analyze the raw data in such big system . We first construct a series of graph with a property retention in Y and then the initial property in X . For example , if you look at the first graph here , the adhesion retention is in Y and initial adhesions in X . We also give the reference line , acceptance line for each axis . When a formulation is selected , for example , I choose this point , they are in the quadrant with acceptable initial adhesion and retention , then all its associate property such as the tensile strength , elongation , hardness , all show up , also formulation all show up at the same time , these are all thanks to automatic highlight . This is all thanks to JMP dynamic link capability . Visualization analysis in such a way is very effective for chemists to know the overall behavior of this system . In polymer science , we measure the modulus of polymer as function of the temperature with an instrument called dynamic mechanical analyzer , DMA . DMA data has a temperature modulus and then attend delta are typically transferred to Excel in a wide format for plotting . To overlay several DMA curve for comparison it is durable in Excel , but it 's not a trivial effort . In case number 7 , we stack 74 DMA results together and by using the graph builder , we can compare DMA results very , very quickly just by clicking . I cannot imagine doing the same in Excel that has a 222 column . It 's basically 74 sample times 3 signal per sample . It 's going to be very difficult to handle in Excel environment . A graph builder is excellent in turning a very complicated graph seen in Excel into a visually digestible analysis . In case number 8 , the needle pull strength is illustrated in graph builder using four variable . We have eight adhesive on the top , we have three different radiation system on the Y here , and we also have a four radiation time and then two needle hub combination . See how easy it is to understand this JMP graph as compared to the Excel graph right here . Good . In case number 9 , we are conducting accelerated aging study for four epoxy prototype formulation by measuring their initial adhesion on three substrate and with a three replica . The aging condition are two different temperature and then eight week aging time with two weeks testing interval . This aging design and the data was initially recorded in Excel and we converted the Excel data into JMP table with seven column , seven column only and the stack format and then we make a graph . You will agree that the visualization in JMP graph builder in this case is much easier to see the aging performance than looking at the busy Excel table here . In formulation stage , we frequently need to optimize composition . Case number 10 is example where two catalysts in polyurethane are optimized with the DOE design . The factor are : catalyst ratio and the catalyst total amount . There is a 10 -run face -center central composite design , the predictor provider indicating that the catalyst total amount factor has a long linear effect on the work life . The white area in the contour profiler is actually the suite design space which desire work life . In this case , 20 -28 . It is important for chemistry to select this green highlighted white area for better production robustness than the area [inaudible 00:12:10] in blue , it has a higher tolerance for the amount change there just in case operator makes some minor mistake . That 's why it has a better production robustness . We routinely see chemistry perform statistics analysis of adhesion data like what you see here , but we hardly see anybody presenting the results about the failure mode analysis . In JMP the failure mode analysis can be performed in two places . One is in the contingency analysis in Y by X platform and the second one is using the graph builder . The case number 11 is the example applying to the silicone sealant where the failure mode change before and after high temperature aging , before and after is clearly shown . Clearly shown , Beside a good adhesion , adhesive rheology will need to be formulated so that it can be effectively applied to the substrate . We have a project to develop a seam sealant to protect the metal joint by spraying process . Case number 12 is the example that illustrating the complication of spraying process with respect to the sealant viscosity variation . We have three spraying variable . They are the pressure , nozzle type , and then the head travel speed and then we have one material factor in this case is the viscosity . Initially , we plan the experiment in factorial DOE fashion , but one of the factor , the sparing pressure are very very hard to control . We end up performing regression of the 40 round with ISM model using the strain width and strain thickness as the response . We get a good model with R -squared about 0 .91 for width and 0 .81 for the thickness . The modeling result tell us that the spraying condition will need to be adjusted dependent on the sealant viscosity . This is illustrated in this prediction profile here . Each processing parameter has their own machine limit and also desirable operation limit . If this predicted processing variable are outside those limits , then the chemist will need to redesign formulation reality and making sure that manufacturing has the processing capability to meet the viscosity requirement . This example show that the formulation design and application constraint will need to be considered side by side and JMP is actually a very good tool in facilitating this type of study . Case number 13 is example that JMP is used to handle huge instrumentation data sets . In testing thermal interface material , the temperature at a different location and the power consumption data are collected and then uploaded to the JMP . Once the data are in JMP table , visualization of the data and data analysis of data set as much as 500 ,000 row are still very manageable and has a fast response . That means the geometry actually can be used to handle instrumentation data . We have a project to apply adhesive to software by the sensor printing and this application is challenging with pinhole defect issue . Process engineer changed six processing variable randomly and then collect 21 wrong results . His data analysis did not reveal any special trend , so the JMP was then used for the troubleshooting in this case and the prediction partition analysis has identified factor F as the key factor . Later on we apply the predictor screening analysis and then identify additional factor D that need a further investigation . For JMP training , we learned that the predictor screening can identify predictor , they may be weak alone , but strong when they are used in combination with other predictor . In the scaling up and the manufacturing stage production , when the batch run into the issue , the raw material lot -to -lot analysis is one of the troubleshooting item in order to isolate the potential raw material effect . This exercise is typically done in the Excel table . But when the multiple raw material and multiple lots of each raw material are involved , it is difficult to look at a huge Excel table to analyze the raw material effect . In case number 15 , a polyester formulation with three raw material and about 45 separate lots are plotted verses the date of manufacturing with the color scale of the gel time . This heat map plot provide a visual analysis for the production engineer to determine whether a particular loss of raw material is the major cause of the out of spec batch . We turn the Excel table into a visual way for better analysis . Statistics comparison in T -test or ANOVA analysis are performed routinely in the product development . A product benchmark exercise typically involves multiple product running under various testing protocol , aiming to have a very comprehensive the product comparison learning here . Case 16 is an example of statistic analysis involved large combination of 23 products and then more than 10 testing protocol . In JMP , a large volume statistics analysis is not a challenge since creating of the sub -table is not required in this case , as compared to other software . One can utilize the column switchers and the local data filter to create all the combination of property and adhesive for statistics analysis . Plus the results of each analysis can be copied into a JMP journal to streamline the reporting . For case number 17 , the needle bonding testing of light cured , historically , have a high data variance . Case 17 use JMP to summarize 18 reports of needle -bound testing which involve multiple lots of adhesive , and those are tested in various time . The needle pore strand , its the COV , are plotted in graph builder under various lighting , radiation condition , as well as the substrate combination . With the local data filter here , one can easily change the criteria selection to have a comprehensive comparison of this adhesive and their consistency performance . When this result was presented , everyone was amazed with the JMP capability . It is so versatile and so powerful . This is the last case for the application gallery . In this case , number 18 , we use the parallel plot feature in the graph builder to demonstrate visual comparison of 15 performance items and 10 adhesives . Each performance has its own unit and scale which provide a visual comparison more quantitatively in contrast to the qualitatively comparison in spider chart which is used in Excel . So far , in the 18 application gallery examples , the data are coming from literature , instrumentation , processing , and not much emphasis on formulated . Now we will switch gear to discuss formulation creation , use worksheets , and it 's a JMP -based worksheet , not a traditional one using Excel . Before we show you the JMP worksheet , let 's discuss about adhesive type . Broadly speaking , adhesive can be divided in two categories : one component adhesives or the two component adhesives , or 1K or 2K . A 1K system like the Super Glue everybody knows require no mixing and it can be cured by moisture , by light , by heat , or by other method . In case we are dealing with one component but heat cure adhesive such as epoxy , then we will need to design and then calculate the stoichiometry or the index to balance the proportion of the epoxy to the amine hardener . Then for the two component system , 2K system , their mixture will react at NDM temperature so that they are kept apart before use . In a 2K system , their stoichiometry will need to be designed and calculated based upon the desired mixing ratio , either by weight or by volume . There are some formulation calculation here we need to perform . This type of calculation design historically been done in Excel . This is the Excel . Everybody know that Excel spreadsheet allow mixed data type in the same column and its formulas can be applied to individual sales level that make it very flexible as a formulation calculation worksheet . Formula are typically organized in column format like this . Each column has a full group of formulation information such as their heading , which is the ID , their recipe ingredient , the formulation characteristic or processing parameter , and followed by the result . What about the result ? Excel -based worksheet is very useful . Everybody using that because it 's easy to learn , but it does come with some shortcoming such as first of all the row matching . When you have a new ingredient or new testing results , you need to match to the right row , and they take time . Then one may need to hide or unhide a column for comparison . Then third thing is it 's harder to analyze the data when the results are put in different tab . It 's a tab -to -tab format . It 's also very difficult to make a graph in such kind of a data structure . JMP offer webinars to go beyond the Excel spreadsheet in various features as listed here . But the worksheet calculation is not emphasized . Perhaps this is due to the inherent data structure that each column cannot have a mixed data type and the column formulas is applied to the entire column which is not as versatile or flexible as compared to the Excel . Despite of these constraints , we have developed JMP worksheet with the following objectives in mind . It should have a broader capability for formulation design , calculation , recording , and analysis . It is all in one and we want to minimize cross -platform copy -pasting . It should be easy to operate , easy data entry and use the JSL for a lot of the automation . Then the final data set is ready for machine learning exercise . Let 's look at our Gen1 , and that is for one component system . This includes four data group . We have a formulation ID , we have a recipe , we also have a material processing characteristic , and then we have a testing result right there . The four data group are the same as what you see in the earlier Excel worksheet , but layer structure was organized in the column from the left to right . This is different from the Excel which is from top to the bottom . The data of the three group , 2 , 3 , and 4 are shared and recorded in the same column , which has a numeric data type . All the recipe , all the testing results , and all the formulation characteristics all in the numerical data type , and they are documented in the same column here . With this kind of a format … The data was also stacked together . I have formulation 1 here , formulation 2 here . With a stacking format , one can freely enter the new ingredient or new testing item without needed to match the role as needed in Excel . JSL was also created to enable data analysis in either tabular way or in a graph format . This is in a tabular way . Chemist can pick several formulation ID and compare their recipe characteristic and performance in a very , very condensed format here . This is very different from Excel without needing to hide /unhide columns to bring formulation to be adjacent to each other . Much , much easier under the JMP format here . Besides tabulation , one can make a graph of the property versus the property comments or the sample ID , but not the ingredient percentage . This graph can be combined with the recipe table here into a group under the dashboard operation . This make it as a very effective visualization analysis . As for testing involves multiple replicates . We typically just record the average result . But one can enter the individual replicate data in the property column , and then perform the T test , the all -over test , using this worksheet here . In case people doesn 't want to enter data in this way , there is the other way to virtually link the data file with the replication result with the worksheet . That will be shown later in the presentation . So far , what you see is our Gen1 worksheet which involves no formulation calculation . Chemists in my group has been using this tool for more than one year . They get used it its easy data entry and very , very powerful tabulation analysis . Next we 're going to look at the Gen 2 worksheet that can overtake the Gen1 feature . It has an additional feature for the formulation calculation for the 1k and 2k system . This worksheet also link with the other JMP file that has additional raw material information needed for calculation . We have the other worksheets , we call Gen 3 , that are designed to deal with the solvent borne system . It also allow formulator to incorporate master batches , but due to the time constraint it will not be discussed here . This is our Gen2 worksheet . There are three sections . We have a heading and then the formulation input section right here . The middle one , we have a calculation output . The third section is the processing material characteristic and also the testing results . Section 1 and section 3 are like the one in Gen1 , but the section 2 here is newly added . The column row name is used to link the reference file that has additional data information needed for calculation . You can see the symbol for the virtual link right here . After chemist enter the formulation ID , they will specify for columns , parts , row , name , and initial weight . If they are doing the 2K system , they need to also specify the mixing ratio either by index , by volume , or by weight ratio . Then the worksheet will output the mixing ratio characteristic here again by index , by volume , or by weight . They also provide a normalized composition , either by part . By part means A and B sum up together by themselves and equal to 100 , or A and B mixed together . We call it normalized by total here . After seeing this one and the chemist can perform the experiment and then come back to enter the results right here . The other thing is in the property material characteristic , we have the other column called Lookup . This can extract the information from the calculation and also the raw material fraction percentage ratio and automatically displays right here . Then chemists just need to copy parameter in the value enter column and then this will be automatically transferred to the two normalized percentage column for display purpose . We also have three JSL there to facilitate in analysis . The first one is showing you normalization , normalized by total . That means A and B being mixed together and sum up to 100 . Here , I showed you the formula , showed you the characteristic and showed you the result . You have a second JSL that 's normalized by part . In this case , you can see your part A formulation and part B formulation , and then A and B all have been normalized to 100 by themselves . With the other JSL , we can change the formulation worksheet format from the stacked to the white format . In this case their ID performance , individual ingredient , and then the characteristic will all have their own individual columns . With this format , one can make the graph with the property versus the ingredient percentage which cannot be done under the stack format . One can also looking for the correlation between the performance or the performance with the formulation characteristic . At this moment , I like to show you the live demonstration . This is the formulation worksheet I just showed you in the PowerPoint . Basically , we have the heading . Then we have a formulation input section . We have a calculation between n1 and n2 . Anything here is for calculation . Then we have the last section here , that is a performance and then the property material characteristic . I mentioned that we have a JSL , allow people to look at this result easily . Let 's look at this one , JSL by total . We can easily highlight any formulation or compare 2 and 8 , and then compare their formulation and their result . These are mixed together . We can look at it by part . Part A here and then part B here . They all sum up to a hundred by themselves . Easily , we can compare … Oh no , I need to remove this one first . I can compare formulation easily by manipulating the local data filter . Again with the JSL , we click the Join All . We are turning the stack format into a wider format . Each row belong to one formulation with the heading here , with their property , with their formulation , and with their formulation characteristics showing right here . For machine learning , we can highlight a role ingredient and then just manually add zero so that each ingredient has zero or whatever , and then now we can do this one . We can create a summation or something , easy to operate in this . I 'm going to show you next how this one work in the sense that assuming that we 're going to create a formulation . I 'm going to copy the heading . Sorry , I 'm going to delete everything here because I create this one already before . I 'm going to delete the demonstration one . I 'm going to create it from scratch by copying the heading here . I change the name to Demonstration here . I will copy the formulation because I 'm going to modify formulation from this one , the DOE 8 . Then the DOE 8 is based on one -to -one mixing ratio by volume . But in this new one , we could change it to one -to -two mixing . A divided by B is one divided by two , so it will be 0 .5 . Then I copy the heading including the mixing ratio all the way down . Now all the calculation has been done here . With this weight percentage I 'm entering , it showed that the material has an index model ratio A to B to be 0 .65 , which is too low . We need to , using our chemistry knowledge , to turn this around . In this case , for example , I make this one 2 . I can easily make this one into 1 .05 . That is the range I 'm looking for . Basically , assuming it is the design that we want , formulation we want , the next thing we want to do is to copy some of the testing that we already had before , that we are monitoring before , but without the results , of course . We have a new result here , so I 'm going to delete that one . But we also want to add additional property which for example is viscosity measure at a room temperature . With this section here , then we want to extend our heading to specify those are belong to this formulation . As soon as I specify the heading , the Lookup automatically give me the information such as the missing characteristic . 1 .5 or 0 .5 , they are automatically copied to here through the Lookup function and then the feeder loading in the formulation normalized to Total while also being extracted , sum up together and put it right here . Now I can copy this information , put them in value enter , and specify my mixer is number 2 , and then start to enter my results , time that 's going to be 80 , and adhesion 450 assuming , viscosity 20 ,000 . I 'm pretty much finished everything , so let 's look at the result here . We just enter Demonstration . This one was based on the DOE number five . DOE number five is one to one mixing and this Demo is only one to two mixing , and we added the viscosity result right here . It 's very , very easy . One click you see the result and in the format it 's very easy to understand for comparison . This is the end of my demonstration . Let me go back to the presentation here . We consider the JMP worksheet that I 'm just showing you is an integrated platform and here is the summary . The worksheet in the stack format , here , is used for formulation design , calculation and for recording the results . The data entry of raw material which is needed for the worksheet is minimized by virtually linked with the other file that has additional raw material information . JSL was widely used to automate the worksheet output to the tabulate , to graphic , to the statistic analysis , and also to create a table with wide data format . The wide data format , they already have a data structure for modeling via the machine learning and also allow the graphical analysis using the ingredient as one of the axis . Then since each of the row in this wide format is a unique tool formulation ID , this actually can be used as a reference table to join the other JMP file that has a testing result that has a replication . When these are joined together , then we can plot the raw data and do statistic analysis , either as function of the ingredient or as function of the formulation ID . This JMP Integrated Worksheet Platform truly illustrates it is an all -in -one platform , very , very capable . In summary , JMP is not just an advanced DOE software . JMP 's data analytics has been effectively utilized in my group for product development at various stage to speed up the innovation process . JMP -based formulation worksheet is an integrated platform that feature broad formulation capability , all in one , easy operation , and machine learning ready data structure , and more and more waiting to be further explored . With this , thanks for your attention and I also like to acknowledge the people I work with and learning to JMP together and also our management system for supporting JMP adoption initiative . Thank you very much .

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

The development of innovative new products can be accelerated using statistically optimized DOE and regression modeling. With the goal of maximizing efficiency and reducing expense, it is often tempting to limit the collected data to key product attributes, such as customer specifications or internal quality metrics. However, increasing the number of available responses by including a wider range of more fundamental measurements in the analysis can often be critical to success. This talk covers examples from projects at PPG’s Coatings Innovation Center. We highlight the use of JMP from design through analysis to visualization of the results for a fractional factorial and a constrained mixture/process design. Using tools such as column switcher, multivariate plots, pairwise correlation and mixture profiler, we demonstrate how the inclusion of ancillary responses helped to develop a deeper understanding of the systems being studied and pinpoint the causes behind unexpected results. My name is David Fenn. I work for PPG Industries at their Coatings Innovation Center just outside of Pittsburgh. Today I'm going to be talking about some of our experiences with using DOE for novel coatings development, particularly focusing on the importance of using auxiliary responses. The agenda, I'll talk a little bit about research methodology, and in case you're wondering what I mean by auxiliary responses, I'll define that in this section. Then we'll go on to two examples. The first one, a new resin development for architectural coatings, and the second one, a protective coating, then we'll finish off with a few general observations. There are various frameworks that can be used to describe the new product development process. One that I particularly like is shown here, DMADV. Here we have five stages. The first stage is to define the goals of the project. What are we trying to achieve? Then we get into measurement. What are the critical characteristics we need to measure, and do we have suitable processes in place to measure them? Then we think about analyze . What factors can we change to make improvements? Then onto the design stage, where we deliberately manipulate those factors and the levels of those factors to try and affect an improvement and lead to optimum performance. Then once we have an advanced prototype, we get onto verification, thinking about will our solution work in the real world? What are the important things we need to think about when we apply this framework? Well, in terms of the defined stage, the goal needs to be clear so that the whole organization has the same understanding of the goals, and it needs to be impactful. If we're successful and we deliver a solution, will it fill a real unmet need in the marketplace and be a successful product? If we skip now to the end, any solution we apply needs to be cost- effective. It needs to be robust. Then the middle of this process, we want to get through this process as quickly and as efficiently as we can. We want to deliver the product to the marketplace as soon as we can, and we want to expend the minimum amount of cash and the minimum amount of resource to do that. Clearly, DOEs and a lot of the tools that are available in JMP are well set up to make us succeed in this area. One of the tools that I like to use particularly early on in a project is a process map. This is a very particular type of process map. It's really mapping the process of carrying out the research and development. I'm showing here a simplified example of a process map to develop an automotive base coat. We have all the steps that are involved in our experiment. We make a resin, we use that resin to make a base coat paint, we spray apply that base coat onto a substrate, we apply a top coat onto that base coat, then we cure those coatings together, and then we measure the properties that we get from all of that. All of these steps, we list all of the factors that might play a role in these separate steps. This is useful for a number of reasons. First of all, it gives everybody in the team a unified understanding of what the process is we're dealing with and how are we going to affect it. It also allows us to capture all of the variables we can think of that might play a role in the various steps so we don't overlook anything. Then it's a good starting point for thinking about which of these are we going to try and manipulate, which of these are we going to focus on to try and deliver a successful project? These factors are further subdivided and categorized. First, we have our Xs. These are the variables that we can manipulate to try and affect an improvement in our product or our process. Then we have our big Ys. These probably appear in the specification of the product. These are what we're really trying to achieve. This is what the customer really cares about, what the customer will pay for. Next, we have our Ns, noise variables. These could be variables that we may be not controlling, we're not deliberately manipulating, but things that could introduce noise into the process, either during the experiments, during the new product development, or in the end application, in the manufacture of the product or the end use of the product. Then finally, the subject of today's talk, we have our auxiliary responses, which we label as little Ys. These might not appear in the specification, the customer might not even be aware of these, but they're measurements we can take at various stages of the process that might tell us something about what's going on. I said in the previous slide that one of our goals is to get through this whole process quickly, as in efficiently as we possibly can. One question that raises is, why don't we just measure our big Ys? We have the ability to carry out DOE's. We could optimize for our big Ys, we could build predictive models. Isn't that all we need to do? Why should we spend time? Why should we spend money measuring some of these little Ys when they're not the real goal of the outcome? Well, I hope in the next couple of examples that I can show you, some cases where carefully selecting these little Ys and doing some good analysis can be really critical to the success of a project. Our first example here, the development of a new resin for architectural coatings. The goal was to come up with a single resin that could meet all of the performance requirements across several product lines in several countries. Our starting point was, we had no single resin that could meet all those requirements. We were using different resins in different products, different resins in different countries, and we needed to come up with a solution that allowed us to reduce the complexity there. Our early prototype struggled in a number of areas, but one particular area was tint strength. The way these white base paints would be used is if I go into a store and request a paint of a particular color to paint the walls of my house, the store will take that white base paint and add specified amounts of concentrated color toners to that paint to create a specific color. It's really critical to be able to hit a target tint strength, which is the measurement of how quickly that color will change as we add a certain amount of a particular toner. We need to be able to control that and hit it reproducibly to achieve the wide spectrum of colors we need to achieve. We also had a few issues in terms of poor heat age stability and poor resin reproducibility. Our approach was to carry out some sequential DOE's to learn how to control tint strengths and some of the other factors. I'm showing the progress on this plot at the bottom left-hand side of this screen. Before we started the DOE's, just some of the exploratory experiments; the orange bar represents the range of tint strengths we were able to achieve. We can see that is far below the target range of tint strengths shown by this green bar on the plot. As we carried out the DOE's, we learned how to control tint strength. We were able to increase it until towards the end of the project when we were doing our optimization DOE's, we were nicely centered around this target tint strength. We were able to build predictive models and use those in conjunction with predictive models for some of the other key properties to identify white space where we met all of the target properties at the same time. But rather than talk about the whole project, I want to now focus on one particular DOE that we carried out. The goal of this DOE was to confirm and quantify something we'd observed previously, that the particle size of the resin we were making was a big factor in controlling tint strength. These resins are, in effect, dispersions of little particles of resin in water, and it was the size of those particles that seemed to be important. We were also using what we call a co-surfactant to help disperse those particles, and we had a few choices about where in the process we could add that co-surfactant. We wanted to look at a couple of candidates for the addition point of that co-surfactant to see if it affected the key properties. Then finally, up until this point, all the resins we've made, we've made at the Coatings Innovation Center. We now wanted to check, could we make these resins reproducibly across three different locations? The DOE we carried out is shown on the right-hand side here. We have three levels for our target particle size. We have two levels for the addition point of the co-surfactant. That gives us a full factorial DOE with six runs. Then we replicated that DOE across three different laboratories. I'll go straight into JMP and I'll show you what the data table looks like. You can see here we have the original data table, the DOE, but now we have a whole collection of data that we gathered during the DOE. The first thing we'll do is, we'll look at what we learned about tint strength. I've already built here a reduced model for tint strength. If we have a look at the effect summary to start with, we can see that the location of addition of the co-surfactant wasn't a factor in determining tint strength. That dropped out of the model. But we do see that the target particle size and the reactor location were factors, as well as the interaction between target particle size and reactor location. If we look up at our actual by predicted plot, we can see it looks like a pretty nice model. We've got a nice R-square, and everything looks to be in pretty good shape. Then probably the best way of understanding what's happening and what this model is telling us is to look at the prediction profiler here at the bottom. We see our anticipated effect of target particle size on tint strength. As we increase target particle size, we get higher tint strength. Then if we look across at reactor location, what we see is that Lab A and Lab C are giving broadly similar results. But if we look at Lab B, first of all, we see that the tint strength that we get from Lab B is significantly higher than we were getting from Lab A or Lab C. We also see that the dependence on particle size is much less from Lab B than we saw from the other two labs. This was a problem for us. Whenever we see that different labs are producing different results with the same resin and the same process, it can be a really long task to work out what's going on here. There's so many potential candidates for the cause of this poor reproducibility. At this stage, we were very concerned that it was going to take us a long time to resolve this, that it was going to derail the project, and we're going to miss our target launch dates. Before we went into any specific activity to try and address this problem, the obvious first step was to look at the data that we'd already gathered in this data table and see if there were any clues that could maybe give us a hint as to why Lab B was giving different properties. Whenever I see a wide data table like we've got here, one of the first tools that I always go to is the column switcher. The way in this case that I will implement this is the first step is to build a variability chart that best shows the problem that we're having. I've pre-built a variability chart here where I've got target particle size and reactor location as my X-axis and I've got the initial tint strength as my Y-axis. The first task is to get this into a format that the best represents the problem we're dealing with. The first thing I'll do is swap over my target particle size and react or location. I'll also add and connect the cell means to add some lines here. Now I'm pretty happy with this. I think this nicely reflects the problem that we're dealing with. We can see Lab A and Lab C very similar results, but Lab B, higher tint strength and less dependence on tint strength and particle size. Now I can use my column switcher, and what this will allow me to do is keep this plot in exactly this format, but quickly switch out this Y-axis, the initial tint strength, for any other variable that I've got in my data table. I'll go into the redo platform and select the column switcher. Now I can select any of the other factors in my data table. I'm just going to select everything that I've got in my data table. Then when I hit OK, I now have this column switcher to the left of my plot. I can click on any of these factors and it will change this axis but keep the plot in exactly the same format. If I select particle size, I can see now I'm plotting my actual measured particle size against target particle size and reactor location, exactly the same format. It looks like in this case, all three labs are giving pretty similar results. I'm not seeing anything that gives me a clue as to what's going on, but I can quickly just scroll through this whole data set. I'm seeing mostly noise right now. I'm not seeing any patterns that seem to be particularly helpful, but I'll keep going. When I get to this plot here, so now I'm plotting conductivity, I see interestingly that Lab B is making resins with much higher conductivity than Lab A and Lab C. That's one useful observation. I'll keep going. This next one, this is actually another measurement of conductivity after the resin has been in a hot room for a week, showing the same thing, still confirming that Lab B is giving higher conductivity. I'll keep going. Mostly noise, maybe a little bit of an indication that the molecular weight from Lab B is slightly lower. I'll keep going. Again, still not seeing anything that interesting, mostly noise. But then I get to this plot here, and again, now we're plotting the pH of the resins one hour into the process, so early into the process, the acidity or pH of the resin. Lab B, again, is different from Lab A and Lab C. It's giving me much higher pH. Keep going just to check if there's anything else. This was the initial plot we started with of initial tint strength, and then the last one is the paint viscosity, where everything looks pretty similar. Really quickly using column switcher, I found out that not only is Lab B making resins with higher tint strength, it's making resins with higher conductivity and higher acidity, higher pH. What could that be telling us? What might be causing higher pH and higher conductivity? Well, these resins, I said, were a dispersion of a polymer particle in water. Anything that's changing the conductivity, the pH is going to be in the water phase. It's not going to be in the resin phase. What we did was we precipitated by centrifuge, we precipitated out the resin and just analyzed the water phase. We carried out a lot of analysis, but one of the things we worked on, I'm showing on this plot on the right-hand side of PPMs, of parts per million of phosphorus and sulfur in that water phase. If I look at the orange bars to start with, the sulfur, I can see all of the resin s from all three labs are very similar, but the blue bars, the level of phosphorus, Lab B is making resins with about four times as much phosphorus as they were making from Lab A and Lab C. When we looked at the recipe for making these resins, there's only one raw material that brings in phosphorus. On a bit of further investigation, what we found out was the supplier that was delivering this material to Lab B was mistakenly supplying something that was four times as concentrated as it should have been, and four times as concentrated as they were providing to Lab A and Lab C. The auxiliary data that we looked at in this DOE using the column switcher, we were able to really quickly pinpoint the cause of that problem. We didn't have to expend time to get there. The project stayed on track, and there was even a bonus. We learned that increasing the level of this material with the phosphorus was another tool we had to increase the tint strength. We would have probably never been aware of that if we hadn't carried out this analysis and had this happ y accident. That's the first example. We go on to the second example now. In this case, we're dealing with a protective coating, a coating that's designed to go over metal substrates like iron and steel and protect from corrosion . We have five experimental resins that we want to look at, a resin that's designed to give good corrosion, and then four resins that are designed to improve the flexibility of the coating. The first three of these resins are added in the first stage of the coating preparation, and then the last two get added in a separate later step. We have two questions here we're trying to answer. How do the resins affect corrosion and affect flexibility, and what is the best combination of the levels of these resins to give us the best combination of corrosion and flexibility? Again, we use DOE, we were able to build predictive models, and here we were using the mixture profiler to identify some white space that we can work in. This DOE is a little bit more complicated than the first one, so I'm trying to represent pictorially what we were dealing with here. If we look at our first stage of our coating manufacture, in addition to our three experimental resins, we have a main resin at a fixed level. In effect, our three experimental resins are three mixture variables here because they form the rest of this 100 %. They add up to a constant sum of 57.77 %. Three mixture variables dealing with there. Then in stage two, we can deal with our other two resins as independent variables because they're not part of that mixture. We have three mixture variables, two independent variables. We also have some levels that the formulators were able to decide they wanted to work in based on prior experience for all of these resins. Then we have some constraints on the combinations we're dealing with. For example, at the start here, we want the sum of Flex2 and Flex3 to be more than 10% but less than 30%. There are some other constraints as well. A fairly complicated DOE, but using custom design, it's relatively straightforward to build this DOE. Definitely some tips and tricks in terms of how to build the DOE, what model to use, and how to analyze that data. I don't have time to go through that today, but I'd be perfectly happy to talk about that offline if anybody's interested. But let's go straight into JMP and we'll look at this example. Here we have the DOE that we carried out. It was a 16-run DOE. If we go right across to the left-hand side, we have our three mixed variables and our two process variables. We've measured our flexibility and corrosion and then we have a lot of other auxiliary responses we've measured. I was able to build good predictive models for flexibility and corrosion. What I'm going to do is just show you those models in the profiler just to help us understand what we're learning and what's going on. I'll add those two predictive models that I built to my profiler, and then I get my profiler here. I can see, first of all, I'm plotting flexibility and corrosion here. Lower numbers are better for both of these responses. Lower numbers for flexibility, lower numbers for corrosion are what we're targeting. I can see as I add my corrosion resin, if I increase the level, I get better corrosion performance, but unfortunately, I get worse flexibility. The opposite is true for most of these flexibleised resins. As I add more of these, I'm getting better flexibility, but worse corrosion. This is something that's very common in coatings development and lots of other areas. Seems like there's always a pair of properties where if we improve one of them, we always make the other one worse. But if I come across to my Flexabiliser 4 resin, something really interesting here, as I add more of this resin, I get better flexibility, but I don't suffer at all in terms of corrosion. This is going to be a really useful tool for us to optimize the combination of flexibility and corrosion. But I'd like to understand a bit more about the science behind this. What's happening? What's unusual about Flex4 that allows us to improve our flexibility without degrading corrosion? Again, I want to use all of this auxiliary data that I've gathered in my data table to help me understand that. What I did is, I want to look through this table, and I'm going to use a different tool this time. I'm going to use multivariate. If I select that, this allows me to basically look at the correlation between all the combination of factors that are in my data table. I'll select everything that I measured and I'll add it in the Y columns and just hit OK. This generates my multivariate. The first thing I see is this table here where I've got all the correlations for all the pairs of combinations of the factors that are put in my table. I can see there are some pretty nice correlations here. I'm seeing some fairly strong correlations, but it's a little bit difficult to go through all this, a bit overwhelming to go through all this and pick out any interesting patterns. I've also got my scatter plot here, and if I add a fit line to these scatter plots, again, I'm seeing some fairly strong correlations, but still I think this is a bit overwhelming to dive straight into. The tool that I like to use to start with here is pairwise correlations. If I select that, this generates a new table where I've got all the possible pairs of variables and it's giving me the correlation. I can sort this table based on any column. I'm going to sort by the significant probability and I'll make it a scending because I want my low significant probabilities to be at the top of my table. Then if I hit OK, I can see the first and strongest correlation I get, in fact, involves this Flexibiliser Resin 4 that was giving us this interesting behavior. I can see a strong correlation with the secondary or TG2. This is a glass transition temperature. The glass transition temperature is a temperature at which a coating changes from being a glassy hard material to a soft rubbery flexible material. My Flex4 level is correlating with here a secondary glass transition temperature that I'm measuring. And I can see also if I go a little bit further down, my primary glass transition temperature, the main glass transition correlates strongly with the corrosion. S cientifically, I think they're interesting observations. What I did based on that is I also built predictive models for my primary TG and for my secondary TG. Now I can look at my profiler, but I can include all of my four predictive models. Now I'll include the two I did before, flexibility and corrosion, but also my primary TG and secondary TG. Now what I can see is that the first two rows are exactly what we were looking at before. If I look at my primary TG, I can see whatever I do in terms of adding new resin. For example, if I add more of my corrosion resin, I'm increasing my primary TG, and that's correlating with an improvement in corrosion. The flexibilising resins, if I add more of those, I'm decreasing my primary TG and making my corrosion worse. That primary TG does seem to correlate, as the multivariate is showing, correlate very well with corrosion. If I look at my Flex4 resin, it was having no effect on corrosion and it's having no effect on my primary TG, so it's different from my other flexivising resins, but I can see for my secondary TG, as I add more of my Flex4, it's rapidly decreasing the secondary TG. The other resins really don't have much effect on secondary TG. What does that mean? What can I learn from that? Well, any material that has multiple TGs, glass transition temperatures, it's usually a sign that it's a multi-phase raw material. It's not a homogeneous material. That was the case here when we did some microscopy. What we saw was our coating had a continuous phase shown by this gray material here, but it had dispersed in that a secondary phase. The primary glass transition temperature was correlating with that primary continuous phase and the secondary lower glass transition temperature was correlating to this secondary phase that we have here. We had a hard glassy primary phase and then a soft rubbery secondary phase. Why that's important is usually high glass transition temperature does lead to better corrosion because it inhibits the diffusion of anything through this layer and stops material getting to the substrate, the metal substrate, and causing corrosion. Usually, if I want to make flexibility better, I have to make this continuous layer softer and that degrades corrosion. But with this type of morphology, I was able to keep my hard continuous phase and gain flexibility through a separate dispersed rubbery phase. This meant that anything that wanted to diffuse through the coating and cause corrosion was always having to diffuse through this high TG area. It's given me the combination of good corrosion and good flexibility together. The auxiliary data that I gathered was really responsible... The analysis to that was responsible for the learning of what was going on in this system. In conclusion, it's definitely possible to carry out successful DOEs where we only measure the critical responses, the big Ys. But I hope I've shown that including carefully selected auxiliary responses, little Ys can often be really valuable, can bring clarity to unexpected results, and it can help us to build scientific knowledge. I hope I've also shown that JMP provides some tools that really help us with this. I've shown a couple, but there are many more that are available. I'd finally like to finish off by thanking the many associates at PPG's Coatings and Innovation Center who contributed to this work.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended