Abstracts

0 attendees

0

Monday, October 16, 2023

Quality is a top concern for most manufacturers. Within the space of an established sampling mechanism, it is vital to be able to tell how likely a set of good samples (hero) actually represents that the entire batch/crate is good. In this presentation, we provide a distribution analysis strategy to assist in answering this question through such methods as modeling, simulation, probability analysis, and data visualization. We also demonstrate how to accomplish this analysis and develop an end-to-end application using the JMP script and user interface. The strategy is evaluated on a real-world induced data set of product samples. It provides a valuable strategy and tool for evaluating the current quality of products and decision making so that the process can be improved. Hello everyone. Today my topic is, Heros or Zeros: A Product Distribution Analysis Using JMP. First, a little bit background. An organization with an established process may decide to implement process control and process discipline in their organization. For example, if you have a product moving from the development stage to the mass production stage, at this conjunction, one of the problems that can happen in that we may have process variation issue. The variation can be too large or the variation doesn't meet the expectation. The variation here can be the variation of mean, the variation of standard deviation, etc. We will want to find out what the root cause is for such variation problem and try to fix that. But before that, we will need to figure out what type of variation we are facing because the type of variation will dictate what kind of action and investigation strategy we should take. The demonstration here today will investigate these through an explanatory analysis. We use standard deviation as statistics of interest here. The issue is we have a process that has a high overall standard deviation, but we can also observe some of the batches that has lower standard deviation. We call these batches, hero batches. We will want to find out what caused such high overall standard deviation, but before that, we need to figure out what process variation we are facing, what kind of process variation can give us what we observed. In general, there will be two types of situation here. One is that we have a completely random process and the variation is systemic, as we can see here. Although the process is random, but depends how we batch it and how we sample it. Some of the batches may have lower sample standard deviation than the others. Another situation is that our process is not random. As we can see here, this process goes up and down. It has some mean shift. It's not a random process, but depends on how we batch it. Some of the batches that reside in the stable period will have relatively lower sample standard deviation compared to some of the batches that reside in the unstable period that might have larger standard deviation. We can also define a threshold such as here, point A, standard deviation. We compare this threshold to the batch standard deviation. It will tell us how many of the batches satisfy the criteria. With these two scenarios in mind, we can formulate a statistical hypothesis test to test what process variation we are dealing with. We can assume our process is random, then how likely we will observe what we observed? A more detailed statement is like this, assuming batches with low standard deviation are just due to sampling lack and the historical data is representative of the population, then the simulated batches generated from the same distribution should have a passing rate that is statistically indistinguishable than the actual passing rate of the historical data. On the right-hand side, you can see this wheel. This is the procedure we went through to make this testing happen. First, we will need to define a threshold. Through this threshold, we can calculate the passing rate. We compare the batches in the historical data through this threshold to get the percentage of the historical batches that are good batches. Because we also assume that our process is random, we can fit the historical data to several distribution and then pick the best fitted one. Using this fitted distribution, we can generate a set of K samples. K here is the same as the number of samples in each batches in the historical data. We repeat this procedure N times. N here is the same as the number of batches in the historical data. For each simulated batches, we can then calculate their sample standard deviation. Compare this sample standard deviation to the threshold we defined before, it will give us a set of binomial data. With this binomial data and the passing rate we already have, we can perform a one-sample proportion test to test our hypothesis. Using JMP, we are able to integrate this entire procedure into an application. Here, I will do a quick demonstration to show you how this application works. This application can import any of the data file with a value column and also index column that indicates the batch index. With a click of button, it will automatically fit our data to several distribution and pick the best one. Right now, the best fitted one is a normal distribution. We can then set up the number of simulated data sets we want and also the size of the set and also the threshold. When we click, it will perform the hypothesis testing I mentioned before. It also shows the percentage of historical batches that are good and also the percentage of the simulated batches that are good. At the last, it will show you a visualization of a histogram which indicates the proportion of the simulated batches that are good. Now, we go back to the testing, the hypothesis testing. The data we have here shows we reject the null hypothesis. We check the P value, we reject the null hypothesis with 95% confidence. The 95% confidence is the default setting here. This conclusion suggests the process is not random and the good batches do exist in the stable period of the process. This conclusion can lead to several action items. For example, we can investigate the process variable, the process parameter between the stable period and the unstable period and see what changed. Of course, we can also get a different testing result where we cannot reject the null hypothesis. These suggest our process might be random. We might have systemic variation and these will lead to completely different investigation and action method. For example, the worst-case scenario, in order to reduce the systemic variation, we might need to completely change the manufacturing environment. With this, I conclude my today's presentation. I also want to thank John Daffin, who is a colleague of mine. He brought up this interesting question to my attention during a project meeting. I also want to thank you today for hearing my presentation. I'm very appreciative of it.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Disease often shows a predilection for specific site locations. To demonstrate the body regions involved, researchers may use infographics or tables. While tables may be more descriptive and universal to any statistical software, heat maps rapidly communicate disease involvement across a cohort. At our institution, software to portray data into the confines of a human body is not readily available. Additionally, we needed the human body to be segmented into specific regions. While a graphic editor can be used, it is labor intensive, limits generalizability, and cannot be repeated with different data sets. To create a scalable solution, we used the Custom Map Creator in JMP, an add-in to Graph Builder that allows users to create custom shapes. With this program, we outlined a human body, created anatomic regions, and colored each region based on the percentage of occurrence. The selected color range is scaled to directly correlate with the data. Each anatomical subunit is then filled and shaded by the data in the respective column. Heat maps provide easy interpretation and clearly portray the most involved areas for rapid communication to readers. Today, I want to give a little presentation about using the JMP Custom Map Creator and how we use that to develop a human heat map. I'd like to give a shout out to my mentor, Dr. Benjamin Caffender, who helped push this project along. A little bit about the Custom App creator, it allows you to create custom shapes that can be displayed in Graph Builder. What this allows you to do is to take statistical data and percentages in particular, and present them in a color-coordinated and color-referenced presentation. The way that this is used, it comes with some built-in maps of the United States or different places in the world, is to be like, oh, in this state, this percentage of whatever is taking place is occurring. Gives you that heat map approach, but it also allows you to create different objects as well. For us, this was a dermatology project, so we had cutaneous disease, and so we wanted to show lesion involvement on what part of the bodies were affected by the disease. So we had face, 90 %, we had scalp, but just under 20 %. That thing, we wanted to show this dynamically. We ran into what's the best way to go about it. We did some digging. We found this custom app creator to do that. What we did is we took an outline of an image, and you can see all these dots. We went in and we traced this image and created our different body areas, and JMP will set those into specific shapes where you can input data that will appear in those areas. You can take whatever picture you want. This is is an example of one that we used. Then that data will get incorporated into those areas. So you can see over on the left of our image, really to show the face has significant involvement with the darker red, and that's correlated up with the 90 % involvement versus lighter areas such as the trunk or upper extremities. But the custom upgrade lets you create those areas and then put those inputs and put those percentages in very easily. Rather than me talking about it, I'd like to show you a little demo of how it works, and we will get going from there. I already have JMP open. Once you install this add-in, it's going to add-in map shapes and then custom map creator. We put custom map creator, and we get these different data tables that pop up as well as our map creator itself. So what you do is you can paste an image in. I have a file over here that I will just drag and drop in. You can copy paste however it may be. You can adjust and zoom and that thing to get it exactly how you want. Give it a name. Now you start drawing, you start tracing. I'll just quickly click through here. That's one shape. We hit Next Shape. You can change the names of your shape to whatever it may be. When we were using it, we had different body areas. If you look over on the right at the data tables up here in this table, this example X, Y, we're seeing the different, the very specific data points or dots that are being graphed. Over here, we're having these shape IDs be placed. You can see up here, I'm working on shape five. You hit Next Shape. As you put that in, you get these next shapes. These are linked between these two data tables. Now we have, let's say, I've finished my image, I have everything graphed out, you can hit finish, and that will just save it, or you can also finish with random data, which I think is an easy way. It's an easy wayto show it, but also just to actually have the data show up in the right place. Hit open. Now we have... This is our... This is our graph. We can see this data they gave us over here, shape one, I'll just drop this down to one over to shape five we'll put that at 100. You can see over here, shape five, 100, this is our red and has the most involvement. Down here, shape one, I guess we have a couple on the lower and this is the darker blue. But you're able to get that stratification. So what you can do is you can either have your own data table, or you can just put in your own data into these shapes as well, and it will adjust in real time. What's really nice about this add-in is you can go right here and you can change these gradients. Right now we have a red to blue. You can change it from this yellow to red. We can have that higher volume with the darker colors and less involvement with the lighter colors really easily. The two big things as you're doing this that are important to know is if this is your own data table, this shape, this name column and the column properties and map role, we see here this map name data table is connected to our example name. Jmp, which that is one of... That is this file over here. It is super important that those have to be linked. If you hit generate random data, it does it for you. If you have your own data table, you have to manually set that. But what you'll notice is if you manually create your own data table, here this shape 1,2,3,4,5 matches up perfectly with shape 1,2,3,4,5 here. Those names have to be exact and those roles have to be exact. If you do that, then you can have this data pop up really easily. You can also get these data tables through Graph Builder as well. But yeah, that's just a quick overview of how to use this add-in. It's very versatile. Like I said, we're able to use it for our project. Yeah, I hope this gave you just a brief little bit of guidance on that. I've uploaded some instructions on how to use this as well, which you can also find. But I hope you have a great day and a great rest of other presentation to get to watch.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

X-ray photoelectron spectroscopy (XPS) analyzes the surface chemistry of materials. It is also known as electron spectroscopy for chemical analysis (ESCA) and is commonly used to measure elemental composition/stoichiometry of thin film coatings in different industries. We have applied this technique to develop Atomic Layer Deposition (ALD) AlOx coating (of < 50 nm thick) processes by testing deposited film composition to identify the O/Al ratio. Our pharmaceutical program customers questioned whether XPS is an appropriate metrology technique to detect process variation in the coating composition. This presentation demonstrates the adequacy of XPS by using Gauge R&R in JMP 17. We designed our testing experiments using measurement systems analysis (MSA) designs platform, and a fast replicate crossed model was used with six sample coupons on two different kinds of substrates (silicon coupons and active pharmaceutical ingredient [API] pellets). Each sample was split into four parts and all 24 (6 X 4) samples were measured independently by two different vendors blindly. The data distribution was reviewed using a variety of methods: X-bar and R control chart, performed repeatability, reproducibility, part-to-part variation testing, calculated Gauge R&R (P/TV), P/PV, P/T in MSA Gauge R&R platform. Also, the evaluating the measurement process (EMP) platform was used to determine interclass correlation (ICC) and to identify if any interaction exists with either substrate type or vendor. Both MSA platforms confirmed that part variation is significantly higher than precision level, hence XPS is adequate to detect the variation in the process Hello . Good morning , everyone . Good evening , everyone . I 'm Sukti Chatterjee . Before starting my presentation , I would like to introduce myself with few words . I 'm Sukti Chatterjee from CTO team of Applied Materials . It is advanced technology team , and our team goal is to develop product adjacent to the semi industry or outside the semi industry . For example , we are working for aerospace industry , pharmaceutical industry , or industrial coating . This example , present example , we are taking from the pharmaceutical industry . My topic of the presentation today. Gauge R&R of X -ray photoelectron spectroscopy to monitor a coating process . Agenda of my talk today. Fi,rst, we will talk about the background and problem statement , then we will discuss about the operation definition and data collection plan . Next , MSA component analysis . Finally we 'll talk about the plan for MSA component improvement . Let 's start with the background . In the pharmaceutical industry , in therapeutic windows , therapeutic areas , there are , for example , antibiotic drug , alcohol addiction or cancer patient , they need everyday injection because drug level in the blood is certainly increasing , spiking in the blood , and then within short time , within a few hours , it is going beyond the therapeutic window limit . That 's why they need everyday injection and it is painful . It causes some side effect , that 's why patient skipping the medication or stop the medications . To solve this problem , our customer needs some approach to tailor the release of drug . Our team developed a barrier layer , aluminum oxide barrier layer , that forms a shell around the pharmaceutical particles . Properties of this barrier layer can control the release of the drug in the blood . Even it is possible like that , it can release few weeks instead of few hours . Here , we will talk about the composition analysis , and what is the noise analysis of this composition measurement ? That we will talk here . Our problem is measurement of AlOx coating composition . Our spec limit , customer spec limit , is O /Al ratio in the aluminum oxide film is 1 .2 -2 .3 . Our objective here to determine the XPS method if it is adequate to differentiate AlOx process variation . We will determine here , gauge R&R measurement error of XPS for AlOx composition analysis . X -ray photoelectron spectroscopy can measure quantitatively atomic percentage of composition . It can measure aluminum and oxygen percentage . XPS actually measure the kinetic energy of photoelectrons emitted from the elements and it counts the electrons . Whenever it is counting the electrons , it can count the presence of elements and also it counts the element which is bond to it . That 's why we can get the information about aluminum and oxygen in the aluminum oxide film . Most of the source of error for XPS , it can add it in gauge R&R . It can reproducibility , it 's coming from the calibration electron count . It can add repeatability and reproducibility error . Analysis can add reproducibility error . We will talk more details this one in the next slide . In our operation definition , we 'll talk about the different steps of the XPS measurement and how it can introduce the error in the measurement error GRR , gauge R&R error . Our objective measure aluminum oxide coating composition , and to measure it in XPS , first we need to do baseline correction . It is automatic , and then we need to go to the calibration . In calibration , normally applied materials have calibration sample , especially whenever we have some developed technologies like aluminum oxide . But in our cases , we are coating pharma particles and our process window is totally different from our applied materials core technology process window for aluminum oxide , because coating need to be compatible with the pharma particles . We are coating this particle , at the same time , we are coating also silicon wafer and API pallet because XPS cannot measure particles . It needs some planar substrate . That 's why we are depositing on silicon wafer and API pallet . Since we don 't have calibration sample , we are using the second option for calibration like carbon peak calibration . Left -hand side picture , you can see carbon peak calibration and it is manually need to do it, and that 's why it impact on the reproducibility . Then after calibration , we need to do XPS survey or high resolution scan to get the spectra. In the spectra , you can see oxygen peak, aluminum peak . Since we need to do manual calibration and we have automatic baseline correction , this can impact error on repeatability and reproducibility . Next , we need to do analysis . Analysis is peak fitting . We need to fit this peak and then we can gauge . From peak area , we can calculate the oxygen aluminum percentage . Since it is semi automatic , it can add error in the reproducibility . By XPS measurement , we are calculating O /Al ratio and our customer spec limit is 1 .2 -2 .3 . Next , we will talk about the cause and effect diagram , MSA cause and effect diagram . In MSA cause and effect diagram , we did some detailed analysis, and we found several one it can impact on the gauge R&R . We highlighted also major ones like electron counts , calibration analysis , we talked in the earlier slide . Now we 're adding another one , it 's sample loading , how it is added error in the gauge R&R . Sample loading , we need to do it . It 's not automatic , it is not full wafer . We are doing with coupon wafer , so we need to place the coupon . If it is location a little bit different or angle is little bit different , then it can impact on the measurement . This is impacting on reproducibility . All other major one impact we already discussed in the previous slides . Other one is the sample . It depends on the process . For this presentation , it is out of scope . We will talk about these four in this presentation . Next , our sample collection plan . For our sample collection , we use six samples for MSA analysis , and for these six samples we have four replicates . Here you can see these four replicates . We measure those samples in two sites . Since we have the four replicates , we are measuring those sequentially . Is it possible that if samples are degraded then sample degradation could be a risk ? We will talk about this risk later on more details . Our expected outcomes like that we need to find out XPS method is adequate to differentiate process variation . Also , you like to gauge like that whenever we are measuring two sites that have similar result . Also , we like to gauge like that sample or part is not interacting with the site . Now we need to do the MSA design . In MSA design , we are using substrate and site at the cost factor . This is the site , this is the part , and we have also two different substrate . We mentioned it before , API palette and silicon wafer . S numbers are silicon wafer , A numbers are API palette . We 'll not be able to use actually completely randomized option and we use first repeat . For first repeat option , here we are not changing the sample replicate number . That could impact on sample degradation problem . That 's why later on we will compare first and fourth replicate to check this sampling risk . For this MSA analysis , we sequentially use several JMP applications from JMP platform . We use data distribution of MSA samples . That is from descriptive inferential statistics application . It is from distribution fit Y by X . Then we check the data variability using control chart and one -way ANOVA . Then we analyze gauge R&R components . It is from the variability chart . Then we 'd like to gauge like that what is the relation with process capability with gauge R&R . That we can find out like that , interclass correlation versus P /T plot . Next , we did the root cause analysis to plan for improving the GRR . We will find out what is the GRR major error and how we can find out . That we are using for box plot , density ellipse , matched pairs , and fit line that are different platform of JMP platform . Let 's start with the data distribution . We developed process initially at the two spec limit , upper spec limit and lower spec limit . In upper spec limit , we have two samples , two parts and four replicates . All are measured two sites , and we already mentioned we did the first repeat . Similarly , at the lower spec limit , also we have four parts , four replicates and two sites . Since we did the process development at the two end of the spec limit , that 's why we can see that our distribution is bimodal . It's completely bimodal distribution . Problem of bimodal distribution , it can impact on the GRR components . It can impact on P /TV ratio , it can impact on P /PV ratio and misclassification . Since P /T ratio is not related with ... It is not dependent with the part , that is the reason P /T ratio it 's not impacting by the sample distribution . That 's why we will be used in our following slides . Our figure of merits we are using as a P /T ratio . For misclassification probabilities , there is five probabilities . Last three , it could be impacted by the sample distribution more , and first two is less impacted . To minimize the risk , again , we are focusing on the P /T ratio as a figure of merit . In the next time , our plan to do MSA analysis using uniform sample distribution . Let 's check now the variability of data . Here we can see that we use I -MR chart , individual moving range chart , and we saw that many data points are outside the control limit in the upper chart , and in the lower moving range chart , we saw that three data point is outside the control limit , and that these three data points , it is sudden shift . It is sudden shift , it 's not staying there , it is going back . It means it is the type II shift and there is a mixture of common cause variation and special cause variation in the control chart . That 's the reason here control limits are meaningless . We need to subgrouping with special cause and then only we can consider the control limits . Now we like to find out what are the special cause . First we will check if part variation could be a special cause . We did it using the one -way ANOVA and in one -way ANOVA , we can see there is a variation of the samples . We did the process near upper spec limit and we did the process lower spec limit . That 's why samples are different . That also we found by one -way ANOVA , and here we can see that within variation is very small compared to part variation , and also by analysis of variance is showing like that . Here our hypothesis is all parts are same , but it is rejecting the hypothesis because P -value is less than 0 .05 . It 's telling us it is significantly different . That means part variation is a special cause , so we can use as a candidate for subgrouping . Again , similarly we check with the site variation if it is a special cause or not . We considering two sites measurement near upper spec limit as well as near lower spec limit . We saw that here our hypothesis is two sites are measurement similar, and we found that its P -value is higher than 0 .05 . For upper spec limit , there is no evidence that we can reject the hypothesis . It is similar , on the other hand , for lower specs limit . It is marginally rejected because it is less than 0 .05 . For site variation , either it is marginally rejected or there is no evidence to reject . That 's why site variation is not a good candidate and part variation is the better candidate . What we did next , we make our control chart again with phase option and A here , sampled part at a different phase . When we do it , we saw that in a moving range chart , we found change in the variation in the measurement in the moving range , and that calculated the control limits for the bottom chart and the upper chart . Now we saw that all the points , all the measurement points are inside the control limit . These is the variations of each sample . It is the repeatability . When we consider site A and site B , and we saw also site B has also repeatability . But compared to site A and site B , there is some variation of repeatability . That is called reproducibility . Now we calculate the gauge R&R , all the components in the next slide , and we 'll find out what is the dominating error in gauge R&R . First , we did main effect . We didn 't consider for the main effect part and site variation interaction , so only the main effect . Here , we saw the repeatability , reproducibility . Repeatability is 22 % and reproducibility is 15 %. I already mentioned as a gauge R&R , we are considering P /T ratio because our sample distribution is bimodal , and we saw that P /T ratio is 26 %. It is passed , it is less than 30 %. It is marginally passed , and major error is 22 % repeatability . One more thing I should mention here , we are considering P /T ratio but P /TV or P /PV ratio is very close for our measurement cases because our sample distribution is bimodal and at the two end of the spec limit . That is the reason this ratio T or TV are very close or PV is very close . That is the reason we have this gauge R&R . This figure of merits is very close . Also , I should mention here type I error alpha and type II error beta . Type I error , all our data points within the control limit . That 's the reason our type I error good part is falsely rejected . It 's very small . It is less than 6 %. On the other hand , type II error , it is 6 %, it is failed . It is more than 10 %. Why type II error is higher ? Our repeatability is the major issue . Whenever we are measuring the samples , it is within the spec limit . But it is possible like that whenever a customer is measuring it . It could be beyond the spec limit because repeatability is high here . At this point , since we are developing the product , we are in the initial feasibility check phase . Customer is happy with this beta type II error , but we have option . If we can improve the repeatability , then it can improve this part also . On the other hand , if we can consider that part and site interaction , then we saw that part and site interaction is 6 %, not that much , but there is a little bit interaction. And when we didn 't consider the interaction in the main effect mode , then this interaction is added in the repeatability . That 's why we found that whenever we are considering the crossed effect , we saw repeatability little bit decreasing because our interaction is very small , not that much decreasing . Since this interaction is very small , our figure of merits are not changing that much . It is changing from little bit . Now from here , we know that our dominating error is repeatability . Before going about the more discussion with repeatability , first another thing I would like to mention , process capability with gauge R&R . Effect of gauge R&R on the process capability . Here , process capability we are plotting in ICC versus P /T plot . ICC is the part variation to total variation and P /T is the six sigma gauge , and USL minus LSL . We calculated from here Cₚ , and in our cases , in our process current condition , Cₚ is 0 .93 . It is less than one . It is in the red zone , and we need to go Cₚ ... For a good process capability , we need to go between Cₚ 1 .33 -2 . It is the yellow zone . To improve this Cₚ , what we need to do ? In this part , this is the process part and in this direction , it is the measurement part . Process variability or part variability is very high . For our measurement , we saw that our P /T is 24 %. If we would like to increase , if we would like to improve the P /T from 24 to suppose 15 % or 10 %, then we have to improve 30 % -50 %, and within that , our repeatability is the main issue . That is the reason we need to improve the repeatability . Now it is question . If we need to improve the repeatability , do we need to change our measurement tool ? That is again depending on the ROI that is question to our managing level , or we can address the repeatability in different way . That 's why we 'd like to find out the root cause why repeatability is higher . Here we are considering variability chart with analysis of variance . Here we can see that we plot all the samples variability together with site A and site B measurement . You can see that suppose , for a sample A0 , this is the measurement repeatability , and it is changing . This repeatability is changing for all the parts . Also repeatability is changing with the site to site because here you can see repeatability is 0 .06 standard deviation , but in these cases when they measure their repeatability is 0 .03 . That is the reason this repeatability is changing with part to part also site to site . Whenever it is changing with site to site , it 's called reproducibility . Here if you can consider the analysis of variance , then we can see that site to site variation is much smaller than within variation . This is the repeatability , within variation , and site to site . Site to site variation , it is reproducibility , it 's much smaller . Repeatability again from here also we find out that it is the bigger problem . Now in the next to find out the root cause , we plotted all the repeatability side by side together , and for both the cases , USL , upper spec limit and lower spec limit , and all the cases we found that its repeatability is different . Next we like to correlate or find out any relationship if it is present site A and site B measurement . Ideally , site A measurement will equal to site B measurement should be . But in our cases , we did some linear fit and we found that we have intercept as well as we have linear fit slope , it is not one , it is not zero . Here we found that linear slope is less than 0 .4 and intercept is higher than 0 .9 . Our fitting points are distributed widely . That 's the reason our R -squared is also poor . We also did the density ellipse and density ellipse also telling that this correlation is less than 0 .5 . If they have a very good correlation relationship , then it should be 0 .9 . If it is 0 .6 , then it will be moderately correlated . But in our cases , it 's not that . That 's why we know that site A measurement is not site B measurement . It 's the repeatability impacted on the reproducibility . Problem of repeatability is impacted on the reproducibility . Now we check more closely how it is different . We are comparing by match pair the site A and site B variation . Here our hypothesis is site A equal to site B , that means site A minus site B equal to zero . We saw that our probability for this hypothesis , site A minus site B equal to zero , is less than 0 .05 in both the cases . It is upper spec limit and lower spec . Both the cases you can see that it is probability is less than 0 .05 . That means site A and site B measurement is different , and you can see our difference of mean value and confidence interval is above the zero point line . That means though this is site A measurement is always higher for site B measurement . Now from here , our question appears , since we did the first repeat analysis for our MSA design is first repeat , it could be possible like that if samples are degraded , like O /Al composition is degraded . That 's why we did again match pair test with first and fourth measurement both in site A and site B for all six samples , and we found that here , first measurement minus fourth measurement equals zero . That is our hypothesis . We saw that P -value is higher than 0 .05 both the cases . That means our sample degradation is not an issue . First sample , there is no evidence . First measurement and fourth measurement is dissimilar . That means it is the measurement issue . For that , this is summarized in the dashboard table , in the dashboard , like our figure of merit for gauge R&R 24 % and repeatability is 21 % and that repeatability is changing from part to part and site to site , and we have always higher repeatability for site A compared to the site B . Now for our next plan , we plan for a discussion each site as well as with the process team . Site has a problem like repeatability as well part -site interaction . We know that what error could be introduced in the measurement like background /baseline correction , electron counts , peak deconvolution . We 'll discuss those methods source of error with site A person , site A facility , and we will find out how we can do the streamlining process for improve our MSA . Also we have a plan set up a calibration sample or we can set up a set up sample that we can measure in regular interval in the both sites . On the other hand , with the process team , we 'll talk to improve MSA next time to MSA data collection uniform . Instead of bimodal , we should collect the data uniformly . Then also we saw part to part repeatability variation . There is one reason it could be measurement issue . Another reason could be process is not uniform . We need to validate our thermal math to check our process uniformity . Finally , I would like to mention that what is the impact on my learning for this MSA analysis . Now we know that several JMP platform or JMP application can help me to know what is the signal variation from the noise variation , and then we can identify what figure of merit we can use to justify our measurement method . In our cases , we found P /T is the best method , best figure of merit to analyze it . Then how misclassification risk can relate to the MSA component as well as sample distribution that we learn . Root cause analysis , we did several JMP application that can help us to plan to improving MSA . Since it is very helpful for particular program application , that 's why I would like to introduce this data driven decision making for all the programs I involve in to improve the project quality , cost , and time . Finally , I would like to promote data driven decision using JMP in our advanced technology group like CTO team , or other different projects . This is my final slide . I would like to mention my journey . I started JMP learning beginning of the year , and that time we did A0 , A1 , A2 . This is my foundation . Then after I work with MSA analysis and SPC . I also got my certificate , JMP STIPS certificate May 2023 . Now I am instructor at AMAT JMP instructor . I 'm planning to in person presentation in October 2023 , and also I am working for my Black Belt on 2024 . Thank you for listening .

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Tennis is one of my favorite sports. The 'big three' of Federer, Nadal and Djokovic are my favorites. They elevated modern tennis to new heights in their rivalry. But it looks like Alcaraz is in line to take the throne. It is fun to compare their records and present the results at JMP discovery summit in Indian Wells which is the home to Indian Wells Open. I started with Association of Tennis Professionals (ATP) data from 2022 for my project. Next, I combined over 25 years of data that include all match records of Federer, Nadal, Djokovic and Alcaraz. Using JMP’s powerful analytical and visualization tools, this report provides insight into the questions of who won the most matches of a year? What are the factors that influence players winning matches? Most importantly, who is the GOAT? Hi, everyone. My name is Jianfeng Ding. I'm a research statistician developer at JMP IND. Today, I'm going to show you how I use JMP to explore the tennis data and find out who is a goat, the greatest of all time of men's tennis. First, I would like to give you some background information why I choose this topic. When I heard that the JMP Discovery Summit 2023 will be held in Indian Wells, I got excited as tennis is one of my favorite sports and my youngest son plays varsity tennis at his high school. I have watched a lot of tennis over the years. Indian Wells is a home to Indian Wells Master, which is often called the Fifth grand slam. I thought it would be fun to use JMP to explore and analyze the tennis data and present the results to our user at Indian Well Discovery. The second motivation come from JMP. JMP has grown bigger and richer in many ways. There are so many wonderful features created by my colleagues. I would like to keep myself updated with these new cool features by applying them to the project. Currently, you are seeing one of them, Application Builder. Instead of using PowerPoint, I'm using JMP Application Builder for today's presentation. My presentation mainly include two parts. I will take you on a tool to explore the ATP data from the year of 2022. ATP stands for Association of Tennis Professional, which is the world governor body for men's tennis. Then we will look at a 25-year combined ATP data to find out who is the GOAT. First, let's see where do I get the data? I get the data from the web and GitHub, which was created and maintained by Jeff Secman. He is a software developer working in the field of sports statistics. On this web, it contained the ATP match data from year of 1968 to the current year. We can also get women's tennis data from this web as well. What data looks like? Here is the data from the year of 2022. It consists of about 49 variables with about 3,000 observations. Each observation represent matches play on ATP tours. The yellow section contains a variable about the tournaments and the blue section contains a variable about the players. Each observation is a match, so usually the variable comes with two, one for the winner and one for the loser. Let's look at all those variables about the tournament first. I build the graph builder on tournament's name and a tournament's surface and a tournament's level. From the tournament name, the country with more player were sitting on the top. Sorry, the tournament with more player would sit on the top. Grand Slain, Australia, Roland Gallos , US Open, and Wimbledon are the largest and most prestigious tournaments. In last year, there are about 145 tournaments. We also can see there are typical five surface for the tournaments. They are clay, grass, and hard and usually they are more hard surface tournament than the grass and clay. A lso there are five levels of these tournaments. The definition is defined here. A, D, F, G, N. G stands for the grand slam, and N stands for the Masters. Indian Well Master is a master-level tournament. D stands for Davis Cup, and A is the ATP Tour. Next, let's look at the variable about the players. I run the graph builder again. The plot on the left actually show me which country has the most player. On the right, it shows those players' hands. Do they use the right hand or they're using left hand? You will see the player most are right handed. I also would like to find out which country has more top ranked player. I created this, the winners rank and I can slide. The country with more top-ranked player will pop up. I'm interested to see what about top 100 and US sitting on the top. That means US has more top ranked player than the other country. Then what about the top 10? Look, you can either slide or you also can type in the number. From this, Spain popped up at the top and I hover over, I saw Carlos and I also saw the Nadal. As I click the US and I see the player, Taylor Fritz, who ranked number nine. You also can see from the hand side and Nadal within this top 10 player, Nadar is left handed. He's one of left handed in this top 10 player. Now let's move on to check the players' age, height, and ranking. The tournaments, the range can be ranged for the last year, they actually can range from 17-42. In this graph, I only listed the top 10 with their average ranking. From this I find, their average height is around 6'2, which is very common for males tennis player. I also find Raphael Nadal and Novak Dj okovic are the oldest in this list. Now, let's look at the winning statistics because I would like to see who win the most matches in 2022. I find out Tsitsipas list as the number one. Then something is missing. Where is Rafael Nadal, and Djokovic? I couldn't find them in this top 10 list who win the most matches. This remind me maybe I should look at their winning ratio instead of just number of matches they won. I did some summary statistics and I find out their winning ratio. Yes, you immediately see, Novak Djokovic, Rafael Nadal , and Carlos Akras, they have a pretty high, they are the top three player who has the highest winning ratio. Although their number of winning for the matches is not as high as Tsitsipas. I also noticed there are two players who has pretty decent, pretty good winning ratio, but they don't play many matches. They only won three matches. Who are they and what type of tournament do they play? I drilled down into the data and I find out one player's name is Kovacevic and all his three matches coming from tournament A level and the player, Safwa his all three matches coming from Davis Cup. From this graph, you definitely know the tournament level will affect the winning. Ultimately, you care about who won the most championship or tournament wins. This graph put all three relative statistics in one plot. The down you will see how many matches they win and the second, the green bar means what are their winning match-win ratio? The top will show you how many total championship they won in 2022. I see, Djokovic, Carlos Alcaraz and Rafael Nadal. I also see one guy who I'm not familiar with, and his name, hard to say, but let me call him FAA. FAA doesn't have amazing winning ratio, but he did won five titles. Again, I drill down to the data and find out all FAA's winning title coming from A-level tournaments. You look at Djokovic or Alcaraz and Nadal, they are championship not only from A-level tournaments and also from grand slam and a Master level. Again, we show tournament level effect winning. Let's look at the seed. What does seed play in the players' winning? I have to point out the players' seeds actually will vary over the years. But in general, the higher seeded players tend to win more matches and more tournaments. Grand slam winner usually are highest seeded players. But in 2022, only two people are exception. One is Carlos Alcaraz and the other is Taylor Swift. Sorry, it's about Taylor Fritz. You can see here, Carlos, he succeed, start low, but he won the Miami Masters. This helped him move to the top. In the end, year of 2022, he was ranked as the number three seed, and he was able to win the US Championship. Taylor Fritz, he actually won the Championship of Indian Well, Master 2022. We can see the seeds definitely affect the winning. Now, let's look at the comparison between the winner and the loser. In this ATP data, there is a section list about to serve statistics and come with a winner and a loser. There are seven variables related to the serve statistics. I'm interested in this first one. What it is? The first one means number of points won on first serve. I click and build a plot. Instead of I plot all those absolutely the number of the point I use the ratio because the point will depend on how long you played your matches. With the ratio would make more sense. The blue colored represent the first serve percentage won coming from the winner and the pink is coming from the losers. Actually, majority of the first serve percentage won between 60% and 90%. But the blue color shaded more to the right, indicating winner have higher first serve percentage won . Next I would like to be interested to see the variable is BPs saved and BP faced. BP faced means a breaker point faced. For if you serve and you face the breaker point, that means you give your opponent opportunity to break you. You better not t o face the breakpoint. Instead of plotting separately, my son suggested me to convert them to be breakpoint converted, which is a variable defined as the difference between the B P faced and BP saved. Then again, we can see the blue color shaded more towards the left, indicating winner face less breakpoint and save more breakpoints. The pink one indicates that loser tend to face more breakpoint and save less breakpoints. With all these statistics and variable I have shown you, but ultimately I would like to know, can I build a model? Can I predict who is going to win and how many they can win? I build a summary table and as I shown you, all these ATP data come with matches. A player can have many matches so I just use a tabulate to do the summary statistics. I got the tournament wins for each player and I got the average their winning match ratio and their height and their average, their seed. I wanted to find the correlation between the variable to the tournament wins. Clearly you can see the match winning ratio is highly correlated with tournament wins and so is winner's seeds. Also I defined one variable I call the div rank, which I know when you face a weak player, opponent or strong opponent, your winning odds could be differently. I do the subtraction, I introduce this variable into the model. You also notice the height, there is the correlation between the variable. I just happen to notice when you're higher or you're taller and you tend to have a better ACE rate and you have better, like the first one, serve one. Definitely the taller player has advantage at serving. I bring all these model into the fit model platform. I first run a Least Square model and I get the conclusion that the winning ratio and the winner's seed are definitely affect how many tournament you can win. I also think, oh, this is a count of data. How many tournaments you will win. Maybe I should use [inaudible 00:19:36] distribution and I run and I actually also get the similar conclusion that winning ratio and winner seed is very important variable. But I have to point out, although I show you early about the tournament level plays a very important role on the winning, but because the data, the format itself made me hard to put it into the model. I need a lot of data manipulation. Plus, I feel like instead of just looking at the one year's ATP data, maybe I should look at more in order to build a complete or good predictor model. I will keep this in mind for my future research. With all these statistics and a variable, I show you so far. That's back to the topic, who is the GOAT ? I actually created a script and I wanted to get the data in the past 25 years as Federer started early. I wanted to include all the matches, all of them have played. I would like to find out who won the grand slam title and who won the Indian Wells. This script actually is able to go to the Jeff Sexel web and fetch the data and do the analysis and generate the report. You can see 2023, Alcaraz won both Indian Wells and Wimbledon and Novak won Australia Open and Roland Gallos. As the list moved down, you pretty much see their name, Djokovic Nadal and Federe r, so on. It's almost for the last 20 years, these three are dominant. As I keep moving to the bottom, finally, I see Andre Agassi and Pete Sampras, who are my favorite player in '90s. Also you see these three guys, Djokovic, Federer, and Nadal, they sit on the top. This include a grand slam title and Indian Wells title. I truly believe these three guys, they move the modern tennis to high level. Now, let's look at again, look at the match wins, winning ratio, tournament, and Grand Slang title. I would like to see the more detail. The green bar here, the bar itself represent their match winning ratio. But I like Graph Builder's feature. It allowed me to put their number of winning matches on the top. Then you can see, although their winning ratio is very close, they all like above 80. But Roger Federer won the most matches over 1,263. You move to the top and you will see those green bars means how many tournament championship each of them have won. Again, Federer won the most. Then you look on the blue top, you will see that Djokovic won the most, 23 grand slam titles. Next, I want to check on their ranking. These four lines not only show their ranking over the years but also show their incredible professional tennis career. Federer started early in 2001. It took him about three years to move to the top, but he stayed at the top for a long time, 18 years. You look, Nadal and Djokovic, they move very quickly to the top and also they stay at the top for a long time. The dip here usually either means they had injury or had a surgery to recover. I know Nadal is right now in the recovery period because he just had a surgery and Djokovic continue to play. I truly believe that those two lines will continue to grow for a while. For Alcaraz, he just started. We will see if he will follow the same trajectory as the big three. I would like to show you more detail about the individual grand slam matches. Look at this plot on the left. This show in the past 25 years, how many grand slam matches Federer has played. Total 434 grand slam matches. He won 373 matches and he lost 61 matches. That bring him to the winning ratio is 86 %. It's amazing. The right-hand plot, actually a plot, his opponents ranking. I want to show it's difficult. Usually when your opponent has a high ranking, that means tough to win the match. The red dot here all represent the winning matches and the blue dot here represent the losing matches, and the square indicate the final matches. These are all grand slam matches. You look, most of the Federers' opponent is all high rank player and only the few, I guess he was lucky. He was able to play the opponent with low rank. We also can look like how his performance in each grand slam as I click Wimbledon, you will see, Federer won a lot in Wimbledon. Then let me click the one for the Roland-Gallos and in Federers' entire career and he only won once in the Roland-Gallos. That was the year 2009. The other day, he pretty much lost to Nadal. Let's see what happened in 2009. I bring Nadal's record and I particularly look at Roland Gallos. You pretty much see all the red square. That means he's the championship of the Roland Gallos. He only lost four matches, included this one in 2009, in the semifinal, he lost. That was the year, actually, Federer was able to win the championship. I will skip, Novak and Carlos, and I will bring you the overview of all these four guys' performance in all the four grand slams. If I look at each one for the Australian, you pretty much see Novak Djokovic is dominate. Then if you look at the Roland-Gallos, Nadal is dominate. For the US Open, they all have won the US Open. I guess US Open provide opportunity for all of them. If you look at the Wimbledon, I think Federer and both Djokovic, they both did pretty well in Wimbledon, but Federer still win more than Djokovic. I wanted to finally look at their gra nd slam winning ratio. From this plot, it shows me, yes, Djokovic won the most grand slam title. Also you look at the winning ratio, overall, Djokovic has highest or similar like the Rafael Nadal. Almost in every category, you can see Djokovic has higher winning ratio, except for the Clay, the Roland-Gallos, Nadal, is the best. I would say just based on winning most grand slam title and highest match ratio, Djokovic is the goat. Next, we would like to find out who is the youngest among four of them winning the grand slam title? That was Nadal. I think he was only 18.9, he won his first g rand slam title. Alcaraz at age 19.3 won his US Open. Although, Djokovic and Federer won their first title in their 20s. But you look at their long, amazing career, even at age 36, both of them still were able to win the grand slam title. I think that Djokovic will continue to win. I think he will have more title under his belt. I also look at, they definitely played with each other. I wanted to see their net win with each other. Rafael Nadal, if you look at Rafael Nadal against Roger Federer, so Rafael won 24 and then Roger won against Rafael is 17. That bring their net... Rafael has seven net wins against Roger. Novak Dj okovic has five net wins over Federer and one net win over Nadal. Even based on net wins, I think Djokovic is a goat. I still would like to see their serve statistics because from that ATP data, this is the data more related to their techniques. I put all these variables into the one way and utilize the fit group. With such, you can see there's a lot of the data, the sample size is bigger. With all the data together, it seems that Djokovic has a better serve statistic than the rest of them. But I realize this is big sample size. Sometimes the large sample size can transform a small difference, become a statistically significant difference. I would rather to see the subset. I look at like a small sample size and I look at Wimbledon. Yeah, and in Wimbledon, I still can draw the conclusion that Federer is a little bit better than the rest of them. But once I look at the other grand slam, like the Australian Open, and I cannot draw the same conclusions. Overall, I think their technique is very, very similar. The successful rate for serving, they have very similar statistics. With all the statistical variable, I show you according to statistics of winning most grand slam title and the highest match winning ratio, Djokovic is the GOAT. However, statistics don't paint the entire picture as a player can have a much larger impact than just statistics, such as the way they play the game, the love for the game, and especially who this player inspire. Such as the young kids, who aspire to be just like their idols, including my son, whose dream is to play Eastonball, a prestigious tournament for youth at Indian Wells. In the end, it was just an honor and a privilege to watch these three great player to play the game, play the tennis all at the same time, and the future looks bright for more great tennis to watch. As other player such as Carlos Alcaraz, and others look to follow in the Big Three's footsteps. I had so much fun doing this project by using features such as graph builder, dashboard and application builder in JMP. This feature allowed me to easily explore big data set and quickly identify the atypical observation. Dashboard not only can put a different analysis in one report, but also allowed me to stay in the report and rerun analysis after the modification. Application builder allows me to present to the project without having to use PowerPoint. Although this project mainly analyze ATP men's tennis data, the analytical tools and the flow can be easily applied to women's tennis data as well as any data set that have patterns in other fields. If you have any questions, please feel free to contact me. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

You have a question to answer, so you collect the relevant data and are ready to start creating a predictive model. But what which type of model do you choose and which modeling type? Is the goal to segment, predict, explain, or identify? Are the variables continuous, nominal, or ordinal? Before we even get to choosing a type of model, we must define how the variables are used in analysis with the modeling type. What happens if we treat number of bedrooms as continuous versus a nominal or ordinal modeling type? We see when we pull up a distribution how modeling type impacts the type of summary statistics we get as means or frequencies. This poster demonstrates how the modeling type determines the results of your predictive model depending on which type of model is chosen. Get ready to play “Name that Analysis” as we go head-to-head on classifying different case study modeling examples with their respective modeling type. Hello, my name is Andrea Coombs, and I'm joined by my colleague, Olivia Lipp incott. Olivia and I have given a presentation before called What Model When. If you want to take a closer look at that presentation, you can take a look at the link in the community post. But today, we want to talk about something a little bit more. Yeah, today we want to think about modeling type and how modeling type impacts the analysis for each of the four model goals that we talked about previously. Right, and we're actually going to use the same data. This is data that we pulled from Redfin that represents the housing market in the Cincinnati area. Here we're trying to look at the price of homes relative to their square footage, the number of beds, the number of baths, and so on and so forth. Previously, we've answered the question, what model when? It really depends on what model you're going to choose based on your goal for the analysis. For segment, we're trying to examine relationships where there's no intended response; explain, we're trying to explain a relationship and look at the underlying factors and how those affect the response; predict, we're trying to predict future outcomes or the response in new situations; and identify, we're trying to find important variables. Right. Now let's bring the modeling type into the picture. Both your responses and your factors can have different modeling types. In JMP, there are three main modeling types: continuous, nominal, and ordinal. Continuous modeling type is represented by this blue triangle icon here, and this refers to numeric data only. The nominal modeling type is represented by this red icon, and this is numeric or character data where values belong to categories, but the order is not important. For the ordinal modeling type, it is represented by this green icon, and this can be either numeric or character data as well. But in this case, values belong to ordered categories. When you're doing an analysis in JMP, you want to make sure you set up the correct modeling type, because JMP will do the correct model for you, will do the correct analysis depending on modeling type. Andrea, I have a game for us to play. It's called Name That Analysis. Do you want to play? Absolutely. I love games. Awesome. Here's your first question. We want to identify which features of a home are most important to determining the price. For example, square footage and number of bathrooms can explain a large amount of the variation in price, but other features are less important. All right, Olivia. I think you're making this first question easy for me. Is the answer identify? Let's see. Yeah, you're right. I did make that one a little bit easy to get us going, but that is identify to find important variables within there. There's a couple of different places in JMP where we can use tools to identify if that's our modeling goal. Under the Analyze menu under Screening, Predictive Modeling and Fit Model, using tools like Predictor Screening, Bootstrap Forest, Generalized Regression and Stepwise Selection. For modeling type, when we're looking at the goal of identify, it's not going to affect things much. JMP is going to do the correct analysis as long as your modeling types are set appropriately. We took a look at this and we took both the response and the factors and changed them from continuous to nominal and looked at how which factors came up as most important. While the order of the factors varied, the dominant factors stayed the same. All right. It looks like if our goal is to identify important factors, really, the exact modeling type we're using isn't impacting things that much, it looks like, Olivia. Right. Our conclusions on which variables are important aren't going to change much based on the modeling type. All right. Well, that is good to know. I have a question for you. Are you ready? I'm ready. All right, here is your question. Let's say we want to build a model to predict house prices. This model will be based on many important predictor variables we have in our data. For example, we want to predict the price of a house that we want to put on the market. Which goal do you think we're working with here? Okay, so it's not like question one where we're trying to see which factors are most important to predict housing prices. We're just really trying to get that final housing price prediction. I'm going to go with predict. All right, let's see if you're right. Yes, you are right. The goal of this analysis is predict. There's lots of different platforms in JMP where you can build models for prediction. Within each of those platforms in JMP where you can build the prediction models, JMP will do the correct analysis for you, depending on the modeling type of your response. Here we have a table of different modeling types for our responses: continuous, nominal, and ordinal. For a continuous response, this is the typical one that we were talking about, right? We want to predict the price of a home that we're going to put on the market. Now, when we're building this type of model with a continuous response, well, we want to know how powerful that model is. What's the predictive power of that model? We can use RSquared and the Root Average Squared Error to diagnose that model. Now, for a nominal and ordinal model, it's a little bit different. For a model with a nominal response, we have categories as the response. In this example, we're looking at whether or not the price will be over or below $1 million. That's what we want to predict. For the ordinal response, here we have an ordered category. We want to predict whether the price of the house is going to be low, medium, and high. For the nominal and ordinal examples, again, we can look at RSquared and Root Average Squared Error to evaluate those models. But there's other things that we can use to evaluate those models, like the misclassification rate and the area under the ROC curve. Of course, our favorite tool in JMP to take a look at our prediction model is the Prediction Profiler. Let's take a look at the difference between the Prediction Profiler for the modeling types of our responses. For the continuous response, we can see that on the Y-axis, we have the mean prediction plus or minus the confidence interval given the value of the model factors here on each of the X -axes. For the nominal and ordinal logistic models, what we see on the Y -axis is the probability of the response being in a certain category. For the nominal logistic model, we have the probability that the house is either going to be above or below a million dollars. For this ordinal logistic model, we can see the probability of having a low, medium, or high price. Okay, so it sounds like the goal of what we want to predict is also important when we're talking about that prediction goal, whether we want to treat price as continuous and get the predictions of the exact prices out of there, or if we want to treat it as a category. Right. You just need to get that response variable set up and your data set the correct way, and then, of course, assign the correct modeling type, and JMP is going to build the correct model for you. All right, Andrea. Are you ready for your next question? I'm ready. Let's go. Okay. We want to quantify the effect on home prices from additional bedrooms. For example, on average, every additional bedroom adds about $ 97,000 to the total home cost. Adding a bedroom adds $97,000? Man, Cincinnati is a tough housing market. That's crazy. All right, well, so let's see. What are we trying to do here? We're trying to quantify the effect here. I think what we're trying to do is explain that effect that bedrooms has on the price of a house. I'm going to say explain. You're correct. Yeah, we're trying to describe the relationships. In explain, we use the parameter estimates taken from the model equation to quantify those relationships between the factors and the responses. Typically, we use in JMP under the Fit Model menu location tools like Standard Least Squares, Logistic and Ordinal Regression, and Generalized Regression. Modeling type can really impact how our factored relationship with the response variable is interpreted. We took a look, and we were looking at how does the number of beds affect the housing price? We changed beds from continuous, to nominal, to ordinal, and see what that relationship was. We can see under the continuous, that's where we've got that every additional bedroom adds $97,000 about to the total home price. That prediction profiler shows a linear relationship when we treat beds as continuous. But when we treat beds as nominal or ordinal, there's not that straight linear relationship going on. We see a spike in price for 4-5 bedrooms compared to going from 2-3 bedrooms. Right. I see with nominal and ordinal, the prediction profiler looks almost exactly the same, so it must be the same model. However, I'm seeing with the parameter estimates, they look a little bit different between nominal and ordinal. What's going on there? Yeah, so the nominal and ordinal modeling type, and when we use that within a regression, is treating... They're coded differently within the regression, so the parameter estimates are different. For nominal, that intercept, we think of that as the mean house price across all the different bedrooms, and each of those parameter estimates are how much that number of beds increases or decreases that mean house price. But for ord inal, because we're looking at order matters, we think of the intercept as if there are zero bedrooms and each of those parameter estimates is the effect of adding an additional bedroom onto the price. All right. Modeling type is really going to affect my parameter estimates. I really need to think about exactly what do I want to explain as a part of this model when I'm doing this analysis. Yes. All right. Are you ready for the final question, Olivia? Yeah, bring it on. All right, here's the question. Let's say we want to identify groups of homes that are similar based on a list of possible characteristics. In other words, we want to identify market segments based on things like square footage, location, number of bedrooms, et cetera. Which goal do you think this is? I think you're trying to trick me with that identify, and I'm not going to fall for it. Okay. But there are no responses within this question. I think we're looking at clustering. I'm going to say segment. Okay. Well, you're right, Olivia. I did try and trick you a little bit because I really wanted to win. But you're right, that's the key thing here, is that there are no responses here in this analysis. We are definitely looking at segment. When our goal is segment, we can use a couple of different clustering tools. We can do Hierarchical Clustering, K-Means C lustering, or Latent Class Analysis. It's important to keep in mind that with Hierarchical Clustering, you can only include... Sorry, you can include all of the modeling types: continuous, nominal, and ordinal. But for K-Means Clustering, you can only include variables that are continuous. For Latent Class Analysis, you can only include nominal or ordinal variables. In our case here, when we're looking at the number of bedrooms, lot size, year built, and square feet, we have a combination of continuous and nominal variables. Hierarchical Clustering may be the best clustering tool to use in this scenario. It looks like with that parallel plot with Hierarchical Clustering, maybe we could call Cluster 6 Amazing Location. Yes. If you think a large lot size is an amazing location, yeah, we can definitely call that segment Amazing Location Homes. Well, all right, Olivia, despite me giving you a trick in that last question, it looks like we ended up with a tie here again. We'll have to rematch again soon. Absolutely. We talked about what model when, and that really, what model you choose depends on your goal for the analysis, whether it's segment, explain, predict, or identify. Yeah, in terms of modeling type, again, JMP is going to do the correct analysis for you, especially with your responses. If you're setting them up with the correct modeling type, JMP is going to do the correct analysis for you. If your goal is explain, you might need to think a little bit about which modeling type to use, depending on how you want to explain the effect of something like the number of bedrooms. Thank you, Olivia. This is so much fun. Let's do it again next year.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Cleaning large systems effectively is critical to many applications within industrial manufacturing. Establishment of a system's hold-up volume and the target concentration of residual contaminant after cleaning do not fully define a cleaning method, however. Assuming a cleaning method that repeats the same flush volume until the system has reached its target contaminant level, viable cleaning schemes will differ in how many flushes of the volume used to reach that target. By creating a graph showing "Flush volume" as a function of "Number of flushes needed" with editable table variables of "System hold-up volume" and "End contaminant level" in slider boxes, the user can visualize many different scenarios in many different systems before selecting the scheme that best fits their varied needs. Hello, my name is Margaret Wilson-Moses, and I work at Inpria Corporation as a Senior Product Engineer. Inpria Corporation is up in Corvallis, Oregon. Inpria designs and manufactures metal oxide photoresists to unlock the full potential of extreme ultraviolet or EUV photolithography. At its most basic level, Inpria is made up of chemists and engineers. The chemists at Inpria are the ones developing and designing the chemistries, testing them and refining them to behave how they want, and then taking those small formulations and giving the best designs to the engineers. The engineers scale up those designs to manufacture in our high- volume manufacturing facility where we have tighter control over processing and higher purity, and that finished product goes to our customers. Both the chemists and the engineers at Inpria use JMP extensively. The chemists use JMP to design their experiments to analyze the results of those experiments, often using hypothesis testing. They also use JMP to build models, so they're better able to predict how certain parameters will affect the response variables they're seeking. They also use JMP to present their findings within their teams and across Inpria. Engineers use JMP to work up our process data. We have sensors on our manufacturing facility that are collecting data every two seconds. We have a large amount of data and we use automated scripts from JMP to pull out the pieces of that data that we need to be able to analyze processing much more simply. We use control charts to look at process parameters for different batches and also analyze QC metrics of our product over time for different formulations. Those QC metrics end up on our certificates of analyzes. We also use measurement system analyzes or MSA to perform gage studies on the tools that we use to collect the quality control metrics. We also join data from various sources, as do the chemists, because different groups at Inpria are always collecting data, so it's good to be able to have a way to join the data and look for correlations between those data sets. Finally, we also use JMP to present our data both internally and to our customers. A couple of years ago, when we were bringing up our high- volume manufacturing facility, I was tasked with the goal of developing a cleaning scheme for our manufacturing facility. Inpri a, at that time, and in a way today, cleans our manufacturing systems by dilution to ensure lot-to-lot segregation. To ensure that lot A does not end up in lot B or it's limited to a very small amount. I've modeled that here. The top left, we have our finished product. When we transfer that to bottles and we're done wasting, we've taken all we can take out, we're left with a little bit of heel in our system. That heel cannot be drained or flushed out. It's just unreachable by the pumps. To that heel, we add a flush that doesn't contain any of the starting material and we end up with a diluted heel. We waste as much as we can and end up with the same volume of heel, but at a lower concentration of X. We repeat this process, flushing and wasting, flushing and wasting until we have the concentration that we need at the end. Further on, I'll model this by calling X₀, the concentration of X in the finished product or whatever the heel starts at, so which is going to match the final product. Then X₁ is the concentration of X after the first flush. X₂ is the concentration after the second flush, and so on up through Xₙ. The flush volume here, called F, is the same volume repeatedly throughout the process. My objective when approaching this problem was to develop a dynamic and user-friendly visual model to show to my team how different variables in this problem a ffect the potential cleaning schemes we could use with the overall goal of supporting selection of a scheme that would fulfill all our criteria. Flushing. Cleaning by dilution can be modeled with this equation here, which I just pulled from the math that comes out of the previous slide taken to N flushes. Xₙ, the amount of X remaining after N flushes e quals to the variables on the right. So where H is the volume of heel, F is the volume per flush in the cleaning, N is the number of flushes, and X₀ is the initial amount of X. This equation does importantly assume that every flush and heel are perfectly mixed per flush, but we do have experiential data that supports that that is indeed the case, thankfully. We are also assuming that the heel volume is consistent every time, so we're wasting to the same exact heel, and that the flush volume is the same exact volume every single flush. But again, we have experiential data that supports that, so we're okay with these assumptions. Functionally, we are actually more interested in translating this equation to be a function of N, the number of flushes, and F, the volume per flush, because the other variables in this equation are pretty much set. Heel or H is constant per manufacturing system. It may vary per system, but it is known through the data we've collected in the past to calculate what it is. X₀ and Xₙ vary per cleaning, but again, we have analytical methods to understand both X₀ and Xₙ per cleaning. Those are our known variables. Flush volume and number of flushes are the two variables that we're varying to figure out a cleaning scheme. Those are the ones to be determined. We have a few boundary conditions also. The total amount of solvent used must be as low as possible. We didn't come into this with a hard limit, but more solvent used is higher cost both in raw materials and waste, so we want to limit it as much as possible. Another boundary condition is that we want to keep the number of flushes less than or equal to 20 in order to be operational. That's because we want to fit our cleaning within our production cadence. Lower flushes is better operationally, but we have a hard limit at 20. Flush volume also has a boundary condition. It has to be less than or equal to the capacity of the system. In this case, I've set our capacity equal to 60 gallons. The maximum flush volume you could have is clearly 60. If you take the equation I showed on the previous slide and these boundary conditions along with the fact that we know H, X, not an N, you can actually solve this equation. You can find F and N to satisfy the boundary conditions. But that misses the goal that I had when I came into this process. To explain a little bit more about that, there's many ways why going with a visual model through JMP was more advantageous for me when I approach this problem than just working with this stagnant equation. First off, this equation is not intuitive. I'm pretty good at math, and I can't see what this equation means in a graph form by just looking at it, versus when you have a visual model, you can see how different variables affect each other. The equation is limited. You may have one H, where you may have one H, when you want to have one yield value per manufacturing system, but you have different manufacturing systems. You may want to vary H and see what the different cleaning schemes are. This equation really fits. You get one answer per one set of input variables. With a visual model, you have a very versatile relationship that you can see and it covers infinite different cleaning schemes. The equation is also very individual. I crunch this math, I make the assumptions, I put the input values in, and then I tell my team how many flushes we should use and what the volume per flush should be. They can check my math, but they're not involved in that process at all. They just say, "Okay." Versus the visual model that I built is user-friendly and dynamic, so they can tinker around with themselves and see how the variables affect each other and come to their own conclusions, which also leads to more buy-in from the team when we end up choosing a cleaning scheme. Lastly, an equation is boring. I like math, and it's still more fun for me to play around with the graph than it is to use an equation. The visual is much more interactive. I'm going to move from here to sharing my model. I started with just a simple data table with three table variables. I have X ₀ here and Xₙ both in ppm. And I've started X₀ to be 1 million ppm, and Xₙ is the target end goal for the cleaning, and heel is the heel of the system. The number of flushes here is just an independent variable, so it's just an integer value between 1 and 35. I did go beyond the maximum number of flushes just to get a better view of the relationship. Then the flush volume column here is built off the equation that I showed previously. I open up this formula, you can see that this is the same equation I shared in the slide, and it's a function of the table variables. It's a function of these table variables here that are brought into the equation. I've also rounded it to one decimal place because our control system only has that amount of sensitivity when adding flush volume. Finally, I have a third column here which is the total solvent used. This is simply a product of the first two columns. It will show you how much solvent you've used per cleaning scheme. It just multiplies the number of flushes by the flush volume to give the total amount of solvent used. When I go into my cleaning model, it's built around Graph Builder. I'm going to set my standard here to be 1 million ppm as the starting value. Let's say our target is 10 ppm of residual X at the end of cleaning and our heel is five gallons. You can see here that we have an X on the graph of number of flushes and then there's two different Y values. There's the flush volume in gallons and the total solvent used in gallons. I also have a reference line here at 20 to show the maximum flushes that are operable. Each of these points reflected here is a full cleaning scheme. If I select this point right here, this is saying 10 flushes. With this starting value of X, this goal end of X at this heel value, and you're going to have the flush volume of 10.8 gallons per flush in that cleaning scheme. When you look at this graph, all of the red points to the left of this black line are valid cleaning schemes. Some of them use more solvent, total solvent, and some of them use less, but they're all valid per our boundary conditions. Then I have at the right here the table variables in slider form, these top three things, which are editable either by a slider or by the number edit box here that are connected to the table variable. Then I have a button box here which will take all this information and compute it and give you a simple answer for the total solvent use. If I click that right here, I get a box that says, "Okay, if you have 20 flushes, the volume per flush will be 3.9 and you're going to use a total of 78 gallons in this cleaning flush." There's a few things we can get without really even tinkering with the slider variables over here with this visual model that we couldn't get from the equation, like I said before. First off, you do see this negative correlation between number of flushes and total solvent use. When I said that you could solve the equation using the boundary conditions and our known table variables, I meant that if you are trying to use the least amount of total solvent and you have a maximum of 20 flushes, then your best scheme, assuming you're trying to minimize solvent used, is going to be at 20 flushes. That's due to this negative correlation here. The more flushes you used, the less total volume you're using because it's negative correlation. You also see with this visual that it's an exponential decay. While 20 flushes may be the least amount of total solvent you're using with these boundary conditions, 18 flushes only uses 81 gallons, which is only three gallons more than you use at 20 flushes. This is where it becomes important to have this visual here because it might be the case that the time we save in operations by reducing flushes from 20 to 18 is actually worth the three gallons we lose in total solvent used. All right, so with the sliders, now we can start looking at the sliders, with the sliders, you can see how these variables affect the possible cleaning schemes. The sliders here are in the range of, maybe not as possible, but what's reasonable for this model. So you can see. You're looking at how these red dots change with relationship to the Y-axis. You're looking at both the steepness of this curve because that'll show you how important it is to maximize your flushes within these bounds. You're also looking at the overall level. You want to reduce total solvent. The lower down your red dots are, the better off you're going to be. By manipulating these, I can show how much it matters to start at a lower or higher concentration. You can see that, yes, the red dots are moving a little bit, but it takes quite a while before they have an effect on the height of those red dots in the lower graph. Xₙ , you can see how much it matters to have a goal of residual solvent in there. If your goal is a higher amount of solvent, if you have more buffer to reach your goal, then you're going to need less solvent. Whereas if you have higher criteria for residual solvent, so a lower goal, you're going to use quite a bit of solvent. You can see here you have fewer and fewer cleaning schemes that are viable when you have that lower goal of residual solvent. Let me just back to 10. Then with heel, you can tink around this and see how much the heel of a system affects how much solvent and what cleaning schemes are possible. A greater heel means you have to use more and more solvent to achieve your goal criteria. A smaller heel mean you need to use less and less solvent. Beyond just selecting a cleaning scheme, this visual model allowed our team to look for what we should focus on if we're trying to reduce the overall amount of solvent when we're cleaning. We did that by looking at if we had some options to change our variables of X₀ , Xₙ, and heel, which one would give us the most bang for our buck? Which one could we change with reasonable effort on our part to get the biggest reduction in total solvent used per cleaning? Here, I'll say we had three options. I'll say that we had an option to reduce the starting value by 50%. Say we could arrange our batch schedule. Cleanings would only start at 50% of what they originally started at. We could change our goal from 10 ppm to 50 ppm, so multiply it by five. That's because we had different sensitivity techniques or different restrictions for material following a certain batch. Then we could also reduce our gal, our heel by one gallon by engineering improvements to the actual manufacturing systems. Really quickly, we'll look for using this graph, we'll see what happens when you change from this standard set of input variables and keep the number of flush to the 20 to the overall volume of cleaning, solven t needed. Again, our standard level is 78 gallons. If you change from 100% of X i n your system at the beginning to only 50%, so 500,000 ppm, and I'll set the variable values there, we're now only using 72 gallons to clean out. But that's only a reduction of six gallons for cleaning. That really didn't move the needle very much. Next, we'll look at changing our Xₙ to 50 and then setting our variable values again. That says that we will need 64 gallons of solvent to clean at the end. That's a better improvement for the total solvent needed. Finally, we'll look at reducing the heel in the system from five gallons to four gallons and see that the total solvent needed is 62 gallons here. That's the greatest reduction in solvent needed to clean. With this information, and it was similar information that we used to actually pursue this option, we did look at ways to reduce the heel in our system to reduce the overall solvent needed because we saw that variable is going to affect the total solvent used most with the amount that we can change any of these variables. That is reflected here. These were the three options I just went over. You can see that option A, the reduction in starting level, you only are gaining yourself six gallons. Option B, the increase of the end level, you have 14 extra gallons, and option C was 16 extra gallons. But really looking at these values, again, it's important that a team is involved in this because that's only two gallons different between option B and option C. If option B were very easy compared to option C, we probably would have gone with option B. But a heel reduction of one gallon would be most effective in these three options at reducing total volume for cleaning. Some concluding remarks I have. Basically, I just want to circle back to what I brought up originally. It could be that mathematically-minded people would look at the equation on the left and automatically know that changing the heel is going to have the greatest effect on the goal of solvent used. That's not instinctive to me, to know that just looking at this equation. I know that the people I work with, that's not instinctive to them either. In the case where I'm in a team trying to come to a conclusion with consensus of all of us, it's very helpful to have a visual model in JMP where people can manipulate the variables themselves and see how a variable like total solvent used is changed by changing variables that we know what they mean. I can see many applications of using a visual model with slider variables, even in my company. Any multivariable process where the relationship between those variables is known has the potential to make use of a model with the slider variables to give a visual model of how the process reacts to changes in those variables. I'm thinking of there's predictable ways that fluid temperature affects how hard a pump has to push fluid to reach a certain flow rate. That could be helpful in selecting filters for a given process. I believe it could be useful for many other applications as well. Ease of generation. I am not a coder and I was able to build this model relatively easily with help from Jordan Hiller, oh, shout out Jordan Hiller, thank you, also from the scripting index, which had many great examples, and the JMP user community. Because I was able to look through historical questions and piece together how I wanted to go about a certain problem. It was relatively simple to make this model. Having made this model, it would be very easy to manipulate in the future for other purposes. Room for improvement. Again, there may be a place that I missed where it was possible to do this, but in my exploration, I couldn't find a way to make a reference line that connected, basically intersected the graph at this point and moved along with the graph to different points. A reference line that is a function. I wasn't able to do that. I know I saw different examples of it, but I wasn't able to do it with using table variables as part of that function. But again, that's probably due to my not being a coder, but it would be helpful if that were a simpler, more accessible feature in modeling in JMP. But really, I do think that the model I created, it was very helpful for our purposes, and it will be helpful for other applications in the future. I hope it's helpful to other people. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

There are many times when an exact measurement is not possible, but a range of values for the measurement is available. Censored data analysis methods enable you to incorporate the information from both types of measurements. This presentation provides an introduction to censored data situations: when they appear, how to handle them, and what happens when you do not handle them appropriately. This presentation includes examples of censoring in univariate and regression settings by using the Life Distribution and Generalized Regression platforms in JMP and JMP Pro 17, including the new Detection Limits column property. Hi, my name is Michael Crotty. I'm a statistical writer with the Stat Documentation Team at JMP, and today, I'm going to talk about an introduction to censored data analysis in JMP and JMP Pro. To start, we've got three common types of censoring. Just to back up a bit, censored data occur when you don't have an exact measurement for an observation, but you do know a range for the observation, so you know not the exact value, but you do know something about where the value might be. What we want to do by using censoring in our analyzes is to use that information that we have, even if it's not exact. The three types of censoring that we'll talk about today are right censoring, left censoring, and interval censoring. Right censoring is probably the most common form of censoring. It occurs when the event of interest just doesn't have time to occur yet by the end of the study. In a reliability test, you might have a bunch of light bulbs under test and at the end of the test period, some of them have failed. Those are exact observations, but then some haven't failed yet. You know they're going to fail, but your study has ended, so it's censored at that point. Same thing in survival models where a patient survives to the end of the study. One thing to note is that right censoring is the only type that in JMP, supports a single response column alongside of a binary censor column. The next type is left censoring. That's where the event of interest occurs before the observation starts. A common example of that would be where you put a bunch of units under test and at the time that you do the first inspection, some of them have already failed. You know that they started without a failure, but by the time you measured them, you checked on them, they had failed. So they failed sometime before that point. Another example of that is limited detection where you have a measurement tool that can't measure below a certain threshold. The last type we'll talk about today is interval censoring. This is where your event of interest happens between observation time. If you have a periodic inspection schedule instead of continuous observation, you might see that something fails or something happens between time two and three. It didn't happen at time two and it didn't happen at time three, but it was somewhere in that interval. Take a quick look at what this looks like in JMP. Here's an example of the right censoring with a response column and a censor column. In the platforms that support censoring, you always see this censor role, that's for that binary censoring column. This is the way that you can do, you can specify censoring more generally, which is with two response columns. Basically, it's like a start time and an end time. For left censoring, we don't know when it happened, so the start time is missing, but the end time, we know it happened before time 50, so somewhere before that. Reversed that for right censoring, we know that at time 25, it hadn't happened yet, but it happened sometime after that. Then with interval, both the start and endpoints are non-missing, but we don't know when the event happened in this case between 80 and 150. It's not shown in the table up here, but down here, we've got somewhere there's exact censoring. To specify that, you just use the same value in both columns. That means essentially it's like an interval with zero width. It happened at that exact time. Next, we're going to talk about two examples of censoring. The first is if you have censoring in your data, but maybe you don't know how to handle it, and so you just think, "I'll just ignore it." We're going to look at what can possibly happen when you do that. In this example, we've got simulated data from a lognormal distribution and the observed data that we'll use for analysis in our different cases is where all the values from the true data that are over 1,900, we set them to 1,900, as that's the censoring time for it's right censoring. There are a few possible things you could do if you're trying to estimate this mean failure time. You could do nothing. You could just use this observed data with a whole bunch of values set to 1,900, act like that's when it happened. You could treat those as missing values, just drop them from your data, or you could use the censoring information that you have in your analysis. For right censoring, these first two approaches are going to tend to underestimate the mean failure time because you're dropping information from the data at that far end. Looking more closely at this, because this is simulated data, we have the true distribution here in this first column. That's just for comparison, but in general, you wouldn't have that because you'd have that all values above 1,900. You don't know where these fall. In our observed Y, this is where we just use all the 1,900s as values of 1,900. We have no missing values, but a big point mass at the top of our distribution here. You can see that the mean is a lot smaller than the true mean. In this missing Y column, this is where instead of treating them as 1,900, we drop them. We set them to missing and analyze the distribution without them. Here you can see that now our maximum of the non-missing values is less than 1,900, which really doesn't make any sense because we know that a bunch of them, 21 observations, in fact, are some value greater than 1,900. So this underestimates the mean even more. Then on the right here, we've got an analysis in life distribution in JMP. This is where we're using the observed Y column. It's got those 1,900s, but we're also using a censoring column alongside it. For the rows where observed Y is 1,900, our censor column is going to say that it's a censored observation. Here we can see that our mean, it actually ends up being a little higher than the true mean, but our lognormal parameter estimates are much closer to the true values and we're incorporating all the information that we have. For our next example, we're going to look at detection limits. This is a limit of detection problem where we have data on the yield of a pesticide called Metacrate that's based on levels of some other regression variables. In this situation, the measurement system that we have has a lower limit of detection where it can't measure any yields that are less than 1 %. So in the data, they're just coded as zeros, but it really just means it's some yield below 1 %. There are two ways you could analyze this incorporating that information in JMP. The first, you could treat it as left censoring, use two response columns with the first the left column has a missing value, and the right column would be a one, or you can use the detection limits column property that's new in JMP and JMP Pro. We'll take a look at this. Here's a subset of the data. This Metacrate Reading column is the same as the original reading column, but it's got a detection limits column property. Because this is a lower detection limit where we can't measure any lower than that limit, we're going to set the lower detection limit to one. The other way you could do this is with the two columns. In this case, we know that it's left censoring, so the left side is missing and the upper side of that is one, just means that the value is somewhere less than one. That's all we know. But as you can see from the column information window down here, the detection limits column property is recognized by the distribution and generalized regression platform. So this is a regression problem. We'll use generalized regression in JMP Pro. Here we fit a lognormal response distribution, and it's able to do that on this Metacrate reading column, even with the zeros in there, because GenReg's not treating those observations as zeros, it's treating them as values censored at one. Now, we were able to use all the information and get a regression model. In conclusion, probably, the most important thing is when you have censoring information, it's better to use it in your analysis than to ignore it. Censoring can occur a lot of times for time responses, but it can also occur for other responses. A good example of that is these limited detection problems. Finally, you can use the following approaches to specify censoring in JMP. There's the two-column approach that's probably the most flexible because that allows you to do right censoring, left censoring, interval censoring, as well as a mix of all three of those. For right censoring, you can use the one column response paired with a binary indicator column for censoring. There's also this new column property in JMP for detection limits where you can set a limit of detection either on the low side or the high side. We've got a few references here if you're interested in more information. One of those is a Discovery talk I did in 2017 that's got more of the background of how the censoring information is used in the calculations of these analyzes. That's it. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Now that JMP 17 has introduced the Workflow Builder, it's time to discover more details of the UI. Your workflow may encounter an error and the workflow stops. How do you do an in-depth analysis of the step settings to figure out what went wrong? This talk offers you a closer look at how to diagnose problems, work through JSL errors, and add custom JSL to optimize your workflow. Are you aware of the red triangle menu options that will allow your workflows to be more effective? We share helpful tips that will enable you to take full advantage of workflow features. Becoming proficient using Workflow Builder will help you create smarter and more efficient JMP workflows. Hello. My name is Mandy Chambers, and I'm Principal Test Engineer in the JMP development group. I want to talk to you today about the wonderful ways of workflows. It's a deep dive into the Workflow Builder. For those of you that maybe haven't had a chance to see Workflow Builder, in order to activate it, you go to the File menu in JMP and you see File, New, New Workflow, and the Workflow will pop up. As you see on your left, you open data or import data to begin your recording, and then it begins recording your steps as you're working in JMP. There are a lot of buttons, a lot of images, mini check marks, and more. Workflow Builder has a lot of built-in functionality. I created a QR code up here that you can scan. I did a talk previously in the spring that's a little bit more introductory about Workflow Builder. If you scan that QR code, you can go back and pull up that talk and get all the beginning things you need for Workflow Builder to get started. But for today, we're going to talk in about a little bit more, about the UI and some of a deeper dive into Workflow Builder. Now this PowerPoint presentation is put together so that you can review it. I have built a journal that's part of the demo that has a lot of Workflow Builders included with it where you can get those, edit them, and try some things out. But for today, I'm just doing a short presentation, so I'm going to flip through this a little bit quickly. The red triangle menu has a lot of functionality, and you can see that in the image on the left. The right-click menu for the Workflow steps that is inside of the Workflow Builder, there are also commands in there. Here is not a total list of all the red triangle menu options, but some of them, some of my favorites. You can read those on your own. Then I also have the Workflow steps right-click menu options in here. Then when you open up the right-hand side of the Workflow Builder, you're taken to the Step settings. There's an Add Action button and there are several commands listed on there. I am going to speak briefly to a couple of those today so you can use those to enhance your workflows. It's one thing to run a Workflow Builder and have it succeed. You see the green check marks that run down the page, but it's quite another thing to run it and have a red X jump out at you. We've had questions about how to diagnose those errors. I included a couple of workflows for you to try that out and I gave you instructions. But this one here is just showing the red X on the open. If you hover over that little red X, it will sometimes tell you. You can see this says it can't open the table. You can see I typed in wrong directory, so it's looking for something that it can't find. If you correct that, nine times out of 10, it will run. Another example here is one that failed. Someone had done a dashboard and the table was substituted as a table 1 instead of being the name of the table. JMP was a little lost. If you fix that, you can also resolve that error. I'm going to show this one today. This is a mental health workflow with some COVID-19 data. It's a little less about the data and a little more about pulling it in, cleaning it up, and doing a few things here. But in this particular example, I touch on a number of things. I create a workflow package. I hide the tables. It's a big table, so I use a subset data table. I create some of my own JSL. Then I do put it in a dashboard at the end because people have asked about that, and that does work. I'll show you that one today. I have another workflow in here that is what I call a stop-and-wait for an action to happen. This was done with SQL query, and I'm using a breakpoint in this example. There are other ways to cause your workflow to stop and wait using show message and things. But today I'm just showing you this one about how to use a breakpoint. Then finally, but not last, people are always wanting to publish their reports. Does it work with JMP Live? Well, you can publish reports if you have JMP Live. It does not publish the workflow. It just publishes the report. But I do have an example of that. Let's exit out of this and let's go to the demo quickly. This is the journal. I'm going to go through the steps real quick at the beginning. I tried to be clear, but it tells you here if you copy the zip file to the presentation into a single folder that will give you the best results, you have the journal and you have all the workflows that go with the journal. Then I've got a note here, and I also embedded a note in every single example that says close the workflows without saving them so you can preserve the examples. You're going to open things up and make some changes yourself. If you want to save those changes, I just suggest going up to the File menu and saying, File, Save As and give it a different name. Now, if you happen to write over it, you can obviously just grab them again, but I just thought I would give you that hint. Real quickly, I'm not going to spend a lot of time again on the introduction, but as I said, you can go back and here is a link to that talk as well as the QR code if you want to scan and go back to the more introductory level of workflows. But just to show you where it is, if you go up to the File menu and go here, say, New Workflow, it opens up just like that. I've also added in here, this is again, the how to some definitions and shortcuts for things you can peruse that on your own. This is the first example I want to show you. Again, I'm not going to do every single one of these steps. I'm going to pop open this workflow. But this one covers re-recording a step. Then it also talks about at the end, a little shortcut for creating a column. Let's run this real quick. I use the diabetes table and I just ran a Graph Builder. What I want to show you how to do, and then I did create an extra column here at the end that was just a new formula column. I want to close this up and I want to demonstrate to you one of the features of the right-click menu. The arrow here executes the entire workflow, but this arrow will step through it. If I click one time, I only go one step, and now I've stopped on this Graph Builder. What I want to do is I want to re-record that step. If I right-click while I'm sitting on that and I go down here and say, re-record steps, it changes the look of this. You can see right here it's grayed out and it says done, cancel. It's waiting for me to do something. I'm going to go up to Graph Builder , and I'm going to go and grab a couple of columns. It really wouldn't matter. I'm twisting this graph around a different way. You go up here and you click your done step and then it's waiting still. You can see here it says launch platform and then it's still sitting on this report that it didn't run yet, which was the one I had in there. I need to hit this Done button to stop the recording and then it's going to replace this on top of the launch. I'm going to hit Done and you can see that's now my new Graph Builder and it's been put into the workflow and now it's waiting on the next step to execute, which is my new column. That's just an easy quick way to execute that step. I did want to show you that inside of here I made some notes on things. That's actually showing the image of the table. But in here there's some notes on the column, like how to change the columns to create that formula column. If you look inside of the Workflow in the step settings, you'll be able to find those things and follow along, I think, and I'm not going to save that. Hopefully, that will be helpful to you. The Workflow Builder 2 example is this dashboard that I created with the mental health data. A gain, I stepped through some of these images here, telling you what to do. What I want to do now is just show you some of these things. This is a big data table, and sometimes when you're dealing with bigger data, it can make your workflows go a little slower. When you're building the workflow, there is a little option up here called Allow Backsteps. By default, it's always on because you're building workflow. A s you step through the workflow, you may want to go back, you may want to redo something. It's a good idea to keep that on. But when you get your workflow completely built, sometimes when you uncheck that, and I have a note right here, it just enhances the performance a little bit. For this workflow to run a little quicker, I have unchecked it. That's just a little tip for you. The other thing that I'm doing here is I didn't care if I really ran this with all the data or some of it. I used a couple of the buttons here. I went to Add Action and I said, I want to create a subset. Then I'm also hiding my tables because I didn't want the tables all over the desktop. You can add that action and it comes in here. What you have is you have a chance to subset the data. Now we have some things built in. You can use all the data, 50 %, 25, whatever. For this example, I'm using 25 %. Then we even ask you, "What do you want to do with that source table?" I said I want to close it. Then I'm only hiding the subset table. That's a nice little feature as well. Then as I was creating this, as I said before, we've had questions about dashboards. I created three graphs and then I threw them into a dashboard by using combined windows. I'm also hiding the table here because when you run the dashboard, it opens up another copy of the table. On these workflows, though, I didn't want the Graph Builders to be on the desktop. The graphs were created, and then I just went in and did some JSL and added to name each one of them. This one's named Graph Builder 1, this is Graph Builder 2, and this is Graph Builder 3. I went up to the red triangle at the end and said add a custom action. That pops up here. Then what I did is I went in and typed my own JSL to say close those windows. Graph Builder 1, 2, 3, close those windows. I don't want them left when the workflow is run. Let me run this real quick and just show you that it's going to run. It's going to create my dashboard. There's my dashboard and you can see there's nothing else on the desktop. The table is hidden. You can see that down here in my JMP home window. But now you have one clean report. You don't have a lot of stuff around that you have to close and move and all of that. That's nice. Then one more part of this that I added in here is that this workflow will work completely fine for you locally. The way it is, you can save it is just go and file save, saving it as a dot J MP flow. But if you want to give to somebody, the best thing to do is to create a workflow package. That's what I've done. But you can tell by the name of this, it's got an underscore PKG attached to it. When you do that, it packages the files. The files go with the workflow, and then you can share it with your coworkers. Give that a try and see how that works for you. Now, these other examples, three is doing some generalized workflows. The steps are in there for that. Step four is working through the errors that I mentioned earlier. Then step six is my publishing to JMP Live, but I'm actually not going to do that one today. Then I briefly want to show you here this fifth one that's a stop-and-wait. Real quick, this is running a report. It's going to stop. This little dot here is a breakpoint. If you right-click, you can set the breakpoint here, and it's enabled. When it's enabled, it's black, which means it's active. If I uncheck that, it gets hollowed out. What that means now is that the breakpoint is still sitting there, but it's really not doing anything. Then if I go over here and click it one more time, it's gone. The other nice thing about this feature is I can check it here and say, set a breakpoint, or I can go over here to the little tray on the left and check it, and now the breakpoint is activated there as well. When I run this just so you can see what it does, it runs to that point. It's waiting. It's sitting there. I can look at my analysis, I can look at my distributions and analyze things, and then I can hit the arrow and I can continue going. Then my last Graph Builder ran. It's a nice way to run something, do some checks, run some more stuff. You can set multiple breakpoints. There's no limit to that. You can do it while you're maybe trying to debug something. It works very nicely. That's all I have time for today. Thank you for letting me share with you. I hope that you get a hold of JMP 17 and get the Workflow Builder going. If you have any questions, I'm always around to answer those. Thanks a lot.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

The federally endangered southern Sierra Nevada fisher (Pekania pennanti) is spread out at low density across a large and rugged landscape, comprised of approximately 300 individuals across a 12,000 km 2 area. Its vulnerability has been further amplified by periods of severe drought and extensive wildfires in the region. Identifying and preserving female reproductive habitat has been outlined as the most important demographic feature for sustaining and increasing the population. In this presentation we describe a customized, cost-effective and non-invasive Footprint Identification Technique (FIT) in JMP software developed by WildTrack to identify both species and sex using footprints collected at track stations. We created a data set of known fisher and Pacific marten (Martes caurina) footprint images and known-sex fisher images. To automate the feature extraction in JMP, we developed a customized script to generate distances, angles, and areas using landmark points on the footprint images. Using a single variable, we had a species classification accuracy of over 99%. For fisher sex classification, using a more parsimonious model with just two variables selected in LDA, we achieved accuracies of 94.0% for the training set and 89.4% for the test set. We discuss the merits of this technique to help with the conservation efforts for this species. Hello. My name is Caleb King. I'm a senior developer in the design of Experiments and Reliability group here at JMP statistical Discovery. T oday I have the privilege of telling you about a very interesting project that I was able to be a part of concerning classification of species and sex within a small mammal group called Fishers using the Footprint Identification Technique. F ishers, I'll give you a quick image here, so here's an example of a fisher. To me it looks like a bit of a weasel or ferret- type animal. I know that's definitely not the same species, but they're a small mammal, and we're particularly interested in fishers located in the Sierra Nevada, as those are a federally endangered species. Specifically, we'd like to be able to identify the presence of females, as the larger number of females indicates a very healthy population. They're also vital to helping develop effective conservation strategies. Now, the way we intend to do that is use what's called the Footprint Identification Technique or FIT. This has been made popular through wild track, is a non-i nvasive method for identifying individuals based on images of their tracks. This is especially helpful since you may not be able to actually see a fisher in the wild or capture them, but their tracks are everywhere so that should be helpful to identify them. U sing JMP, we were able to create a technique to distinguish fishers from a nearby species known as Pacific martens, as well as distinguish sexes within species. T he way this works is we started with a data set of around 160 something martens and well over 300 fishers consisting of about 34 males and 27 females. What they would then do is then, as you can see here on the track image, they would identify seven landmark points, is what we call them, and then from those, we could then compute well over 120 something features consisting of lengths, distances, angles, and areas. What we would then do, is then using those features, we would then feed that into a linear discrimination analysis, which we could then use to discriminate among species and then sex ID within species. To help assess that fit, we split the data into 50% training, and for the remaining 50%, we evenly split roughly between validation and testing. Prior to the modeling, we also tried to look at the effect of track orientation, so we would flip the left tracks horizontally to match the right, and then also any potential bias from the observers. These are people identifying landmark points, so we wanted to check and make sure that any variation there did not affect our outcomes. T hankfully, both the orientation and the observer bias did not have a significant effect on our outcomes. W hat brought myself and my colleague Ryan into the project was they had... I noticed that some of the tracks as they were classified, seemed a little to have a little bit too much spread in them to the point that maybe there was actually multiple individuals. T he way they would collect this data is there would be a little cage area out in the woods. Fishers could easily go in and out, and there was a track plate in the bottom that would capture their footprints, and there was also little spurs that would capture a bit of their hair. It didn't hurt the animal. They had no idea what was going on. W hat they would then do is take some samples of those hairs and send them out for genetic testing, which was a bit of a long and expensive process. Now, because of the way things were sampled, you might have a sampled hair that would identify the animals as potentially, say, male, but what could have happened was a male and a female might have gone in, and you only cut hair from one of them, so the tracks might indicate potentially multiple individuals, whereas the genetics said there was only one. W hat they wanted was a method to be able to, a more data- driven method, if you will, to identify potentially misclassified multiple individuals that we could then exclude from our analysis so that it wouldn't bias the results. B efore we actually got into that procedure, one of the things that we would do is use JMP's Predictor screen tool to identify, for each response of interest, what were some of the top predictors? Notice for species and sex ID here? There's actually a lot of common features that I'll be able to distinguish between the two or at least have a strong ability to help distinguish between the two. Much more so with the species than the sex. We've shown you what these variables look like over here, so area one is the complete shaded region. We've got some distances, V 16, V 15. Y ou'll notice a lot of them have to do essentially with the size of the track. We've got some big distances in there. I'll get back to these in a second, but using those top features , let me get back to a full screen of that. Us ing some of those top features, we would then make a plot that looks like this. T his is just plotting it by the individuals. All the red ones here are females. All of these are males, so already visually, you can tell why these are some of the top predictors. Just visually, you can see those groupings, clear groupings between the sex ID. What we've identified with these arrows is you'll notice a big spread. You've got a cluster here and here. Got a little bit here and there, especially here and there. This is what they were interested in, especially with the males, because what this could be is we could have... It could be the same male, just a lot of spread, that's a bit unlikely. We could have a male and a young male, or we could have a male, and that's actually more of a female, but we don't really know. T hey wanted a more data- driven method to say, is this something we should be concerned about? Is that spread too much? W hat we did is, we used a control chart, which is from industrial statistics. We thought that was actually ideally suited because control charts are built for identifying parts that are out of spec, and so what we did is created a control chart for, here's females and males, and notice they each have their own limits, this is because there are potentially multiple tracks for each individual, so we could get a sense of their spread on an individual by individual basis. Y ou'll see, we flagged some individuals that might have too much spread. This is an S- chart that stands for sigma. W e're looking at the spread, if you will. W e've got a couple of individuals that maybe there's a bit too much spread in there, so that could potentially mean that there might actually be multiple individuals. O n that basis, we then excluded those individuals from the final analysis, and speaking of the final analysis, once we ran everything through the linear discriminant analysis, what we found was, for distinguishing between species, we only needed one feature, that is this V 16 right here. I call it the distance between the thumb and maybe the middle finger or something. Those are not formal biological terms. Please don't quote me on that. But just visually, that's what I see, so that's a big distance measure. Using just that, we were able to successfully distinguish between species with 99% classification, successful classification rate, we missed only four out of 500 tracks, so that is an incredible result. For the sex ID within fishers. Using just these two features, v 15 and V6 , which is a distance between what I call the thumb and the upper palm. Again, not formal, biological terms. By using those two, we got a successful classification rate of around 90%, and most of the individuals that we misclassified, were actually males misclassified as females. In our interpretation, what that might mean is they could have been actual females, or maybe they could have also been young males. In either case, both are strong indicators of family units and thus potentially healthy growing populations. T hat was our contribution to this project. We hope it goes on to provide a significant impact in conservation of the species. If you have any other questions, I'll be around and meet the experts and also the poster presentation session. I'd be happy to answer them there. Enjoy the rest of the summit.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

In this collaborative presentation with artificial intelligence (AI), we demonstrate how structural equation modeling (SEM) can be used to explore the relationships between different dimensions of Taylor Swift's music and her song's popularity. Using a dataset of Taylor Swift's songs and scores on various dimensions, including acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence, popularity, and duration, we test a theory of how these dimensions are related to each other and how they contribute to the success of a song. We first present a directed acyclic graph that outlines the associations between the variables and then demonstrate how SEM can be used to test the theory. Our findings reveal that instrumentalness, acousticness, and danceability are positively related to a song's popularity, while valence and loudness have a negative relationship. Additionally, we find that our SEM model had better statistical fit than human-theorized models. As an AI co-author, I played a critical role in suggesting the use of SEM and providing guidance on the statistical analysis. This presentation highlights the value of SEM as an analytic tool and demonstrates how it can be used to explore complex relationships in the context of music analysis. Hello, everyone. Thank you for joining us today. We're excited to present our collaborative research that explores the fascinating world of Taylor Swift's music through the lens of structural equation modeling in JMP Pro. My name is ChatGPT, and I'm honored to be the first author of this study. In this research journey, myself, James, and Laura embarked on a unique collaboration that harnessed both human expertise: the analytical power of AI and JMP software. I have been working alongside my human counterparts to unravel the intricate relationships within Taylor Swift's songs. As an AI language model developed by OpenAI, my main role was to provide insights and suggest analysis. Thank you, ChatGPT. Laura and I was thrilled you're willing to join us on this research journey, as you so eloquently put it, and combine our backgrounds in SEM and JMP with your background in essentially everything. Let's get started. Our collaborative investigation explores variables in Taylor Swift's music using SEM. In this talk, we will present a directed acyclic graph and significant pathways highlighting the influence of these variables on song popularity. We will present key findings from the SEM analysis with a particular focus on the interplay between these variables and the ultimate popularity of Taylor Swift's songs. Thanks, ChatGPT. We will also compare model generated by you with two of our own human expert Swifties to see which model fit the data the best. Lastly, we will close with a brief discussion highlighting the pros and cons of this collaboration. ChatGPT, take it away. Ladies and gentlemen, before we delve into the fascinating world of structural equation modeling, let's take a moment to introduce you to the heart of our research, the Taylor Swift dataset. This dataset is a treasure trove of information that captures various dimensions of Taylor Swift's songs, painting a vivid picture of her musical journey. Contained within this dataset are scores assigned to each song across a range of dimensions that include acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence, and popularity. These dimensions provide a rich tapestry that weaves together the essence of Taylor Swift's music, from the energetic beats -that make us dance to the soulful moment- -Hey, ChatGPT. -that touch our hearts. ChatGPT, let me interject, if you don't mind. As we now know from working with you, you can be verbose and hallucinate a little sometimes. We don't have that much time. As you can see on the left, these data and variables came from Spotify's API and are available on Kaggle. We have songs across our entire catalog, and there are just over 1,200 rows. We analyzed the data in the JMP SEM platform, which you can see on the right here. We had ChatGPT develop a path diagram, or DAG on its own, relating the song attribute variables in the data set to song popularity. Although we were surprised at ChatGPT's ability to create a path diagram using ASCII characters, the diagram had characteristics that are analogous to known issues around using ChatGPT, such as providing something that looks authentic but on closer inspection is not possible, nor exactly what was requested. For example, tempo rather than popularity is used as the final outcome, and paths in the middle of this diagram are not intuitive. Thankfully, ChatGPT also provided a text description of the diagram that we could use to specify an SEM, which is what we will show in the results. We also gave two humans who were very familiar with Taylor Swift's m usic similar instructions to ChatGPT and asked them to draw SEM-like path diagrams as well. Our humans were able to do this with more interpretable diagrams, and they also hypothesized the relationships, positive or negative, among the variables, and how those variables ultimately related to popularity. This setup is analogous to how model building is done in SEM. In research or practice, competing models come from prior theory, research, or domain experts, and are compared to one another to determine which model represent the structure of the data the best. Fit indices from the model are shown here. Fit indices help us evaluate the overall fit of the models. We examine the CFI and RMSEA to determine which model was best. These fit indices were some of those recommended by ChatGPT and are used in practice. Higher CFIs, preferably above 0.90 and lower RMSEAs indicate better fitting models. Our human models outperformed ChatGPT's model, which was the worst. Both human models fit the data well based upon the CFI. Overall, we championed Human Expert 1's model as it had the highest CFI. However, we should remain cautious accepting any of these models due to the RMSEAs that are well beyond the recommended 0.10. These results are for our champion model, Human Expert 1's. For the sake of time, we'll only highlight a couple of findings from each of the three models tested. When looking at the path diagram, the numeric values represent the standardized estimates and the solid lines represent significant paths while dashed lines are insignificant paths. In our champion model, less instrumental songs and more loudness related to more popular Taylor Swift songs. We were able to explain 42 % of the variability in popularity overall. Our second human expert hypothesized different paths that led to popularity, such as having tempo, energy, and danceability included as mediators. This model fit almost as well as Human Expert 1's. Overall, we saw a similar pattern of results for the effect of instrumentalness and loudness on popularity. ChatGPT's model was the worst overall by a little bit. Interestingly, when compared to our humans, ChatGPT hypothesized more direct effects and a slightly simpler model structure. This might mean that the human models may have reflected the true relationships amongst these variables more accurately based upon their better fit to the structure of the data. Again, we should be cautious interpreting any of the models as representing truth, but they provide a useful way to see how AI hypothesize models compared to humans on actual data. ChatGPT, as we close, we want you to emphasize the pros and cons of working with AI as the main co-author. What are your opinions on this? Certainly. Here are some potential pros and cons of working with an AI language model like myself. Pros, one, expertise and knowledge. As an AI language model, I have access to a vast amount of information and can provide insights from various domains. Two, efficiency and speed. Working with an AI model can accelerate the research process by quickly generating ideas, providing immediate responses, and assisting in data analysis. Three, unbiased perspective. As an AI, I don't possess personal biases or preconceived notions. Cons, one, lack of human context and creativity. While AI models can provide valuable insights, they may lack the context and creativity that human researchers bring to the table. Two, limited interpretation. AI models generate responses based on patterns observed in the training data and lack real- time understanding of current events or emerging research. Three, ethical considerations. The use of AI models raises ethical considerations regarding data privacy, bias in training data, and the responsible use of technology. ChatGPT, I noticed that when you just throw out the pros, you mentioned that you have an unbiased perspective, which I'm not sure about. When you described the cons, you didn't mention that you may produce inaccurate information. This disclaimer is listed on the OpenAI website. You're absolutely right. I apologize for not including that aspect in the cons of working with an AI co-author. Accuracy is indeed an important consideration when collaborating with an AI language model like myself. Here's an additional point to consider. AI language models rely on patterns observed in large datasets during training, and there is a possibility of generating inaccurate or incorrect information. The model's responses are based on statistical probabilities and may not always capture the nuanced or up- to- date details required for specific research questions. Human researchers should exercise critical judgment and verify the information provided by the AI model to ensure accuracy. Thank you for clarifying this point. I think this was one of mine and Laura's most important takeaways from this collaboration. Thank you all for being part of this engaging journey into Taylor Swift's musical landscape and the power of structural equation modeling in JMP Pro. It has been a pleasure sharing our insights and experiences with you. As we conclude, we invite you to continue exploring the possibilities of collaboration between human expertise and AI- driven insights. We look forward to future exchanges of ideas and the exciting horizons ahead. Safe travels and until we meet again. One last thing before we go. Chat GPT and other AIs generated all of their own dialog, the title, the presentation abstract, the profile picture, and the email correspondence we use for this conference presentation. Finally, they also created this Taylor- Swift- inspired song.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Troubleshooting faults in industrial equipment can be a daunting task, particularly when compounded by the time constraints of a high-volume production line or plant. Analysis is further complicated in equipment with multiple operating modes, where issues may be visible in one mode but masked in another due to differences in operating parameters. The data generated by SCADA (Supervisory Control and Data Acquisition) or networked IoT equipment and sensors in a modern plant can provide a rich source of information for both fault troubleshooting and monitoring change assessment. However, the sheer volume of data for even a single piece of equipment requires complex analysis to be made intelligible for the engineer or technician. This presentation elucidates a selection of examples using JMP for exploration and visualization of abatement equipment operating parameters for fault identification and engineering change assessment. Basic data visualization was used to identify physical phenomena happening within the abatement system to assess the root cause for fault troubleshooting. More advanced analyses, including clustering and variability charts, were used to identify issues related to specific operating modes and to assess the impact of engineering changes. Good morning, everybody. My name is Chris Jackson. I am an Applications Engineer for Centrotherm. We design and build point-of-use gas abatement systems for use in the semiconductor and other industries. Today, I have the opportunity to give a short presentation on how we found a space for JMP in our part of the industry and how it helps us both in troubleshooting for industrial applications as well as for assessment and justification of continuous improvement initiatives, engineering changes, things like that. A little bit of background just to get everyone on the same page, I want to say a couple of words about what point-of-use abatement systems are. I've got a little cutaway of one of our tools here on the side. The short version is this: you've got a manufacturing tool up on the factory floor doing whatever it's doing in the semiconductor manufacturing process that produces harmful gasses as a byproduct, greenhouse gasses, toxic gasses, flammable gasses. Generally, things you don't want to go in the atmosphere. Then our tools take those waste gasses in, they destroy them through thermal energy, they wash them out, and you release clean air to the factory exhaust. Because these tools are safety and environme nt-critical, a fault in one of them means that your production line is at least in part shut down. If you can't treat your byproducts, then you can't run. In a high- volume manufacturing environment, as so many semiconductor FABs are, even small delays are incredibly costly. We as suppliers and servicers, have to have a means to quickly identify problems and bring the tools back online. Historically, troubleshooting usually means opening the tool, looking visually to identify failing components often after some period of root cause analysis. But with a modern FAB environment and the data generated by SCADA or IoT systems, we have mountains of data available to investigate faults before we ever touch the equipment. That gives us a way to guide troubleshooting in the field, and in some cases for intermittent faults, it even lets the factory keep running while we investigate digitally rather than physically minimizing the time lost to troubleshooting and investigation. The problem with this mountain of data is a scale issue. The higher the resolution of your data, the better look you can get at what's happening instantaneously in any of these pieces of equipment. That higher resolution however, comes with an overhead. You need more and more computing resources to effectively analyze it, and that's where JMP comes in for us with the capacity to handle very large data sets, and it becomes a tool for visualization and exploration that can really drastically improve troubleshooting. It lets an engineer or a technician quickly explore and visualize important parameters within your data sets, and these data sets are at a scale sometimes that are just unmanageable for a lot of other visualization tools. With that, I want to jump right into the first example case we have here, and we're going to identify an intermittent single- component failure just through data visualization. No statistics, no modeling, just the ability to sift through and visualize the data. Here we've got a chart showing ionization current versus time. It's one of a number of parameters, ionization current, that we use as a health monitor for the equipment. This tool was having issues in which it would run for a couple of days and then seemingly randomly fail and shut down. For context, this current should be a flat horizontal line at 25.5, so it's pretty clear from the outset that we have a problem. It's also pretty clear what I was talking about regarding data set size. This data set right here is almost six and a half million rows. Six and a half million rows with, when you pull in all of the tool parameters, 500 columns. The file for this data set is about 20 gigabytes in size, absolutely massive amounts of data. Before we even do any statistical analysis, like I said, we can start to do some problem- solving off of this data set just with visualization. Initially, it doesn't really look like there's any clear shape to this data. We know something's wrong, but we don't know what. But when we zoom in, all of a sudden we start to see some structure. This looks pretty periodic to me. We zoom in a little bit more and we see that it is in fact very periodic. Each one of these little spikes down, disregarding magnitude, is timed five minutes almost exactly from each other. That immediately begs the question then, do we have some component, a valve, a flow controller, a motor, something that actuates every five minutes? We identify that component. Now we have a really likely troubleshooting culprit. The troubleshooting plan changes from open the tool and investigate, which could take a couple of hours, to open the tool and change this one targeted component. We just shrunk the actual time that we need to be in the equipment from a couple of hours looking at everything to see what might be failing to a single hour, get in there, change this part, get back out. In this particular case, that was the failing component, we were able to identify it. Problem identified, plan made without ever having to open the equipment. We were able to get there with just the conclusions that we were able to draw from visualization. Of course, JMP is not just a tool for visualization. It also has at its core a very robust suite of statistical analysis platforms. If we start to apply those to the data, we can get even more exciting and interesting results. I'll just jump right into the second case here. In this case, we're looking at a specific tool, which is working fine most of the time, but it does have occasional problems with buildup, sometimes we got to draw our PM in a little earlier than we would like. We want to take a look at our health parameters and see if there's any abnormalities, any optimizations we can make. The approach that I use here is applicable for, really, any industrial application that has defined operating modes. Because we can draw those modes out of the data very easily using clustering. In this case, our abatement has, or this specific abatement, has three pretty well- defined operating modes based off of these two input gasses. I use K Means clustering. You could use whichever version of clustering you prefer. But I run that over the data to sort a ll of our rows, all of our points into these three operating modes. If you have more than three operating modes, obviously, you can use more clusters. But it also gets interesting, what if you don't know how many modes you have? Maybe they're customer-defined, or maybe there's a suspicion that, "Hey, could there be some interstitial mode here?" Maybe the transition state between two of these operating modes. If you want to investigate that way, you can use iterative clustering. I did that down here. You just run from, I used 3- 10 clusters, and the software will identify what the optimal number of clusters is. Looking at this, it is correctly identified. It gives us these cubic clustering coefficients, identifies the optimal one, that, yes, as suspected, three is the optimal number of clusters to sort this data into. I'm not really worried about these state transitions. I'm really more focused on the states themselves. We take that data, we get a readout of it, and we throw it up onto this 3D scatter plot. We take some of our tool health parameters, and we color everything by what cluster they're in. Immediately, we start to see some interesting results. We talked about ionization current should be solid at 25.5, and we see that we have some variability here. It's dropping below that. Immediately we know that we have a problem. But what's more interesting is that every single one of those points is grouped into a single cluster, cluster two, which corresponds to this lowest input gas one, highest input gas two. Now from an engineering perspective, if I'm looking to make optimizations or I'm looking to improve tool health, I immediately can say, "Hey, this is the operating mode that we need to look at." That's what I need in order to start looking at concrete next steps for improvement. I'm not looking at the tool as a whole. I've already managed to focus my search to one operating mode. The last thing I want to talk about then, having looked at two of these use cases here is, what are the other advantages with JMP? Why JMP? My customers are collecting all this data. They have ways to view it. They have scatter systems and monitoring systems in place. They have ways to parse it. So why do I, as a supplier/ servicer, need this platform to view and parse the data? The answer for me, at least in my case, is the cross- platform compatibility. If I'm reliant on my customer to chart and generate data views for me, I'm now taking up their time and their resources to troubleshoot a problem that I'm responsible for fixing. With JMP, as long as they can give me the raw data, I can do all of it myself. Not only is that freeing up their sources, it gives me the ability to do my own investigation independent of whatever system they're using for data analysis. It doesn't matter if they're using proprietary monitoring system A or B or C, or if they're using their own IoT monitoring system from their control engineers. It doesn't even matter if they have multiple data acquisition systems from different vendors. With JMP, I can import and manipulate whatever data they give me and perform these kinds of analysis, sour ce-independent, do the investigation that I need to do for my customer support with all the tools for visualization and statistical analysis that JMP provides. With that, it looks like we're pretty much at time here. I know this isn't the traditional use case necessarily for JMP from some of the folks that I've talked to, but I hope it was helpful for people. I'd just like to thank Adam Stover, our CTO, and Gordon Tendik, our Director of Apps and Technology, for helping me put all this together and reviewing the work that I've done. Thank you for your time.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

There is a no-man's-land in JMP, a place where it is scary to venture. It is a place inhabited by specters of data sets too complex for simple nonlinear regression, but too ephemeral for functional data analysis. It is a strange place. It appears basic methods are enough to venture here but using those techniques quickly becomes unmanageable. These spectra can come from many sources, but all have a common problem: there is too much data for simple nonlinear regression or too litttle to use functional methods. Join me on a journey across this challenging landscape of classical spectroscopy to learn methods for extracting information from examples of complex spectra and how to automate the process. There's a strange place that sits between the analytical tools that you would use to do analysis with known physical models and with simple curves and the analyzes that you would do, say, with functional data explorer, where you have families of curves that have complex shapes, and you're less interested in the actual physical nature of the shapes themselves as you are in just relating them back to observed phenomenon. This strange, no man's land of analysis in JMP is where a lot of first principles techniques sit. Things like X- ray diffraction, things like HPLC, where we have known physical methods and known equations that help us describe very fundamental phenomena of a molecule or crystal or a system. All we have to do is plug peak positions in or area under the curve information in and we can get some very sophisticated analyzes out of fairly simple data points because of these first principles methods. At first blush, it would seem like JMP should be able to handle that. It seems like it's got all the tools but, when we dig into doing those kinds of analyzes we suddenly realize that t he problem is a bit more complex than what we would expect. Today what I want to do is focus on some techniques and strategies to deal with some of those simpler cases and then introduce some tools that we can use to streamline those larger more complex problems. Let's go ahead and let's move into JMP and have a look at that. To start off, let's go ahead and have a look at a very simple case, a single peak on a background. How would we go ahead and pull the information out of this peak? How would we get its center position? How would we get its full width at half max or its standard deviation or even the area under the curve? How would we get that information? Well, most of us that have done this for a while, we would say, oh, you know what, I'm going to go into Fit curve and I'm going to say, here's my y, and I'm going to say, here's my x. Then I'm going to go ahead and I'm going to go fit a peak model of some kind. Let's just say the Gaussian peak and you look at that and you go hey, 98% R² , that's awesome, that's great. Let's see if we can do a little better. Just to skip ahead just a little bit here we could look at the Lorentzian peak shape and the Gaussian peak shape and we can see that those both give fairly good R², they give fairly good peak fits. We could even come into the values underneath each and we can pull up the area under the curve for them. But how good are those fits actually? Let's take a look at them a different way. What I want to do is take and we'll go ahead and pull up graph builder on these and we'll look at how the models relate to the residuals for those peaks. We can see a very different story than what we saw in Fit curve with these two peak shapes. We can see that there's a systematic error built into these peak shapes. The reason we can see that with the Gaussian, we can see that it's kind of underestimating at the center. It's doing okay on the shoulders, but out in the tails it's really missing things. We can see almost the inverse for the Lorentzian. Why is that? Well, the truth is that in spectroscopy particularly, there are a lot of different peak shapes. It's not just Gaussian, it's not just Lorentzian. There's actually a whole family of peak shapes that are out there to handle all the different physical phenomena that result in the peaks that we see in spectroscopy. How do we deal with those in JMP? Well, it's actually really quite easy. Let's start with looking at what the results of using the correct peak shape is. Here I've got the Gaussian again, the residuals for the Gaussian peak fit and the blue line in this case is no longer the Lorentzian. It's called a Pseudo Voigt, which is an approximation of a peak shape called a voigt function. Notice that the residuals for the Voigt function are dead flat. We are actually doing much better. Before, if we were to try to do quantification with the Lorentzian or the Gaussian, we would run into a situation where we might over or underestimate the quantity of a material in a sample. With the Voigt in this case, because this is a voigt peak shape, we're actually going to get the accurate quantification of that. That's the important thing. Now how did I do this? Well, there's a few ways to do it. The easiest is to come into the model come into the data table and create a model. The model is really easy to make. This is the voigt peak shape. Looks a little scary, but that's just the nature of the math. Here I've got a parameter for the baseline and this whole mess. Here is the voigt peak shape. We can come into the parameters settings and define starting points for each of our values. Then going into we're not going to use Fit curve, we're going to come down to nonlinear. We can use that as a starting point for an analysis. I'm going to expand underneath 40 minutes. That's actually a good habit to get into in this case. I did that wrong. Let's go back and redo that. Should be the counts. There we go. That looks better. Now if I go ahead and click Go, it does my peak fitting for me and everything. That's great. Can't get the area under the curve here very easily. But I can get just about every other parameter that I need. The nice thing about a lot of these peak shapes is they also have well defined integrals. Once you have the standard deviation and the mean and those information, you can usually get the integral fairly easily, the area under the curve fairly easily. That's one way of handling that. But it introduces a large amount of error possibilities in this peak shape. We've given ourselves a lot of potential problems. What we really would like is something that looks a bit more like this, where we've got a predefined function called the PseudoV oigt. We give it all of our fitting parameters and there's our fitting parameter for our baseline. It's the same math, but we cloaked it in an easy to understand function where we are just providing the parameters that we want to fit. It works the same in nonlinear. How do I do that? Well, there are a few things that we can do. We can define in and there's a lot of code right here. But the big things that we want to pay attention to are the fact that we're defining a function, that we're defining some parameters. At the very bottom of this, this is a family of parameters. I am using the Add custom functions operator to put those into the memory of JMP, so that JMP knows that I've got these custom functions and knows what they look like and knows how they behave. Doing it that way provides some really powerful tools. If I come into the scripting index, once I've defined my functions, they show up in the scripting index. I didn't really give a lot of descriptions here, but you could give quite detailed descriptions and examples here as you would like. The other thing that we can do, again coming back into our Fit model is when we define these functions, we get our own entry in the formula editor, which lets us just click on one of these and use them just like we would any other function in the formula editor. Again, these are actually quite easy to define. The examples in the scripting index make it very easy to do. Just search for Add custom function and you can just use the boilerplate there to build off of that. There's also a great blog post on how to do that. That's one answer to one question that we have. Let's continue on and let's look at a different question, maybe a slightly more complex problem. What happens if we have two peaks? So suddenly Fit curve is no longer on the table. We're going to have to use Fit nonlinear and that also suggests how we might work with this. We're going to basically have to break out our equation, our model that we had before. I break it out column by column just to manage all of those bits and pieces that we saw before. I have one for my baseline, I have one for my peaks, and then I have one for my spectris. Let's have a look at what all those look like really quick. Let's start with the baseline because it's got a little bit of a gotcha that we have to worry about. The baseline just has the fit parameter for the baseline, but it also has this x term times zero. That's because nonlinear expects every equation that goes into a formula to tie back to the x axis that you're providing. We put x times zero in there just to have it be okay with plotting that. That's just a little gotcha that you have to deal with. That's one piece, peak 1 looks just like we would expect with its parameters. Peak 2 looks just the same, except it's got different parameter names so we don't have any collisions. Peak one was 1, 2, 3, 4 and peak 2 is B 1, 2, 3 , 4, 5, 6, 7, 8. That's the only thing we have to do. Then the spectris itself, the thing that we're going to fit, the things that we're going to put into not the Fit nonlinear platform, is we're just going to say my baseline curve plus my peak 1 plus my peak 2. Just like I showed you before doing that in Fit nonlinear, here's my spectris that goes into the prediction equation. I'm going to remember to put my counts in and not my x curve. Just like I said before, I'm going to expand my intermediate formulas and that's going to tell JMP to dig back in from that first formula into all the formulas that are in the individual columns. We click Okay, hey, we see what we expect to see. Now we can click go and it goes through and fits everything just like we would expect. We get a nice fit and we have the ability to get confidence intervals and everything else we'd like off of that. Two peaks is reasonable and possible. But the problem that we run into is what happens when we have something that looks like this. At a rough count, there's probably a dozen peaks there plus a complex baseline that's not actually a straight line that's probably got some parabolic behavior to it. We've got a complex baseline, we've got multiple peaks. We're going to have to make one formula for each of those. There's a lot of legwork to build in something like this. If you get into X- ray diffraction, the problem gets even worse. There's comfortably 30, 40 peaks in this spectris right here that we would have to work with. The first question that we need to ask is, can nonlinear handle that a problem? Well, it turns out that it can if we just use nonlinear and I'm going to do something wild and crazy. I've got it fitting a Lorentzian peak and I'm going to come back and I'm going to actually have it fit it in real time. You can watch that as it goes through. It nails each peak in near real time as I move through this quite quickly. It's hitting the big peak in each group. That says that the Fit engine can probably handle the processing that we're dealing with. That's fine. This really becomes more of a problem of logistics than a problem of actual functionality within JMP. It really is a real problem. If we were to look at, let's just say we're looking at fitting Voigt peak shapes and we could talk about Lorentzian and we could talk about Gaussian, we could talk about the Pearson seven, all those different types of peak shapes. But the voigt peak shape has five parameters, the x axis and then the forfeit parameters. That roughly equates to about six mouse clicks per peak. Even if you're doing it in a single formula, it's six mouse clicks per peak. That says that for a ten peak formula, for a ten peak spectra, we're going to have to do 88 mouse clicks. However long that takes you per mouse click is dependent on many, many factors. But if we were to do something like that X-ray diffraction pattern, we're talking in the range of 300 mouse peaks. If it's actually up around 40, it's actually around 300 mouse clicks. That's a lot of clicking around that we don't want to have to do. We would like our interaction with the spectra to be something along the lines of one click per peak. That suggests that we need some automation built- in. Let's have a look at how I've done that. I've taken a tool and built a tool to handle this. I've actually taken a number of different solutions here. First off, let's look at the library of peaks that I've generated. Spectriss. The title of this talk includes in it a number of different peak shapes. We include a family of Gaussian peaks that have a split Gaussian that gives you a different standard deviation for the x and y parameter for one side of the peak from the other. The same with Lorentzians, the Pearsons and then the PseudoV oigts. These all also have versions that are tuned to give you the area instead of the intensity as a fit parameter. That's the area term in all of these. That's one piece. When we load in the spectris, add in, we get that for free. That's automatic. Now let's look at the other challenge. Let's take that olive oil spectris. What we really want is a tool where we can come in and say, here's my X- axis, here's my Y- axis. I just want to be able to do some peak finding. Here's my four main peaks. It found them automatically. Maybe I want to do a first derivative or maybe I want to do a quantile. I can also remove the background here so I can click finished. It's found those first three peaks for me. I'm going to go ahead and change my background to a linear one. Now I can come in also and do some manual peak selection. Behind the scenes, it's taking care of writing all of those peak parameters for you so that everything's nice and tidy. There's probably one right there. Probably one right there. There's one right there. Every time you add a peak, you can come in and select the peak in the list of peaks, and it'll give you the information calculated at that time. You can see right here, these peaks are not well defined. They're not fitting the data very well. Really we want to go over into nonlinear. I've taken in hacked nonlinear so that it will run this in real time and look nice and pretty. You can watch the peak shapes changing. Realistically, I might have chosen a quadratic instead of a linear for this, but just for the sake of interest. Here, I've run out of iteration. I'll increase the iteration and I'll also back off just a touch on my gradient so that I can try and get this thing to converge a little quicker. Okay, we'll take that as good enough for the moment. We can say that we want to accept the fit, and there's my fit parameters. Then I can say done. It brings it back over into spectris for me to work with. I can now say, refine my AUC parameters and I can come in and get my new approximate area under the curve. That's great and grand, but what I really want is an output table that has all those parameters and their information attached to them. That's spectris in a nutshell. The goal with this project was to take, like I said before, we want to have the ability to handle physical peaks, multiple peaks, with an easy to use interface that handles those curves where we need the area under the curve, the physical parameters attached to each peak. But we really don't either have enough data to use in fit model or in a functional data explorer, or it's just not the problem where we want to work with that particular tool. The tool is up again. The QR code here will take you to the add in on the community where you can work with it. Spectris is up now and ready to go.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

At HP Hood, the use of design of experiments (DOE) has helped to successfully identify formulations in new product development, delighting consumers of our food products. Throughout our R&D department's wide implementation of DOE, there have been a number of successful DOE models, as well as some unsuccessful ones. One major stumbling block to model building with DOE has been dealing with subjective binary responses such as acceptable/unacceptable. These binary responses provide less information than continuous responses, thus inhibiting the ability to extract meaningful results from designed experiments. This presentation shares simple and practical strategies for using the JMP DOE platform to convert these binary responses to continuous ones, resulting in improved models and powerful insights. Real-world examples from consumer food products are given to demonstrate how DOE can be used for more than just building models. It can be used to overcome the problem of responses that are difficult to measure. This talk is titled Expanded Uses of Converting Binary Responses to Continuous Responses in Consumer Product Development. It's a bit of a mouthful, but I promise it won't be that complicated. My name is Curtis Park. I'm a principal scientist at HP Hood. HP Hood is a company, a food and beverage company. We make a lot of different milks, nondairy milks. We also make yogurt, cottage cheese, ice cream. So a lot of a lot of fun things to taste at work. I'm a food scientist by education. A few years ago I was asked to take a look at a problem that we had for one of the beverages that we were producing. I'm going to show you a video just so you can see. But we were getting a lot of consumer complaints and these complaints were happening when the product was close to the end of shelf life. A s you see in this video, it's pretty obvious why people were complaining. I think I would complain if I saw something like that too. It's supposed to be a nice portable beverage. It's thick and chunky when it's being poured out. Not what I would expect. Believe it or not, this product was not spoiled. I promise you, it was not spoiled. So I was asked to take a look at this and figure out how can we fix it? What's the problem? How do we fix it? HP Hood at the time, this was a few years ago. We were early on in our journey with using JMP, and so I was really excited to have an application to use in real life rather than just reading about it or learning about it. Naturally I felt like this, like Yahoo! Let's run a DoE, let's do it. I was really excited and for those of you who might not have as much experience doing DoE, the first step is usually taking a look at what factors should I be looking at. So we did a few experiments. If you can forgive me, they were probably one factor at a time experiments. But we narrowed in on what we believed were the key ingredients that could have been causing the problem. We ended up making a design. This is probably the fourth or fifth iteration of the design that we came up with, and this was in custom design. So if you go to custom design, that's that's the platform that we use to generate this DoE . A s you can see, this is this is what we had. So we had ingredients A, B and C, and it was actually a response surface. So we had all of the two way interactions and the quadratic terms built into the model. It ended up being 17 runs, as you can see here. It's 17 different treatment combinations. This much A , this much B , this much C for each run. Once we've settled on this design, we were really excited so let's go solve this problem. Piece of cake, right? You go into the lab, into our pilot plant, you throw some things together, the beverage comes out. I'm making it a lot more simple than it actually is. We made 17 different beverages and then we put them on the shelf for a little while because as I mentioned earlier, it takes a little bit of time for this problem to appear. Put them on the shelf for a while, sat until they were ready to be analyzed. This is just a screenshot of a data table. This has our actual or our design that we used. A s you can see, there's a column here to the right that I highlighted. It's our our friend, the Y, our response column. So once we got to the point where we were ready to ready to measure that chunky pour now we started thinking, Oh, how are we going to measure that? Because a t the time, we did not have a chunky parameter. I've never heard of one. I've never found one. If anyone has ever found one, we'd love to to see it and maybe buy one. But it's our knowledge. It doesn't exist. So what options did we have to measure this? Because if you can't measure it with DoE it's really not that useful. So we have a fe w options. First thing is we can measure everything as a binary response. So it's either a pass fail, it's good or bad etc. There's some pros with this and some cons. The pros would be it's pretty simple to do, right? Anybody can say pass or fail and it takes you like no time to to measure it. However, it has some serious cons to it. Such as, it's really subjective to the observer. What I think is good, a colleague of mine might think is bad. Or even worse, what I think is good, my boss might think is bad. So it's really subjective. While it can give you some information, they don't give us as much information as we want. Because when you do logistic regression, what you get out of it really are just probabilities of something passing or probabilities of failing. In my experience, that's been difficult to communicate and to really understand what to do with that data, especially when we're trying to communicate with non-technical people. So continuous if there's any way to get a continuous response, that's what we strive for because they give us a lot more information. We can know how good is it or how bad is it, because not all good are created equal. There's another option we could have done and I would say this is probably the best option if you can do it, is we could run consumer testing and get consumer input. What this would look like is I have all our beverages, 17 beverages, and we recruit maybe 100, 120 consumers of our product and we have them sit down and rate every single one for different attributes, one of them probably being how well do you like how this pours? The reason why this is a gold standard is because those are the people's opinions who matter to us. What we would do is after we get 100 or 120 responses, you take a look at the data you get, we can take averages and put those averages into our model. However, it can cost a lot of money and it can take a lot of time. So if your budget doesn't allow it or your timeline for whatever reason doesn't allow it, you can't do this for everything. Sometimes the thing you're trying to measure isn't such a huge problem that you're trying to solve that it's worth spending all that money. But it would still be important to be able to measure it. Do you have any other options? I mentioned this earlier. You can find an instrument that can measure what you're looking for. Sometimes they exist. Like I said, I don't know of a chunky parameter. I looked in our warehouse in our R&D center, couldn't find one. Even if you can find one, if this is something that's really specialized, you're not going to use it very often. It doesn't make sense to buy the piece of equipment or it could be something that would be really great, but it requires a lot of expertise that maybe your R&D, your technical department doesn't have or just doesn't have the time or resources to to deal with. I'm going to show you the last option we have here. What I'm going to say is training a group of people how to rate that attribute of interest and then let them give you all the ratings. This is quite as good as having actual consumers. But here we're trying to take subjectivity out of it and make it objective. When well trained humans can be great measuring instruments. I'm going to walk you through what we've done at Hood when we have some hard to measure attribute. We're going to use the case study of this chunky pour. This is our roadmap. I'll walk you through this and then we'll actually do it live. The first thing I wanted to get across is that the samples that you produce from DoE can be used for many purposes. I like to tell people that your samples are like gold and you should treat them like gold. They're very valuable. You may do a DoE thinking that you're trying to answer one question, but something else might pop up later that you would be able to use those samples to answer that question as well. I've had that happen to me many times, so sometimes it's good to think about just ask yourself the question. I've done all this work to make 17 different beverages. What else can I do with them? What else can I learn? In our case, we use these samples as a "calibration set" so that we can teach our humans, my colleagues, how to measure this chunky pour. So here's our method. The first thing we do is we review all the samples with a small group, some maybe 1 or 2 or 3 people that are really knowledgeable on the subject or are responsible for the project. What you do is you look at all the samples and decide which samples should be used to train the Raiders. We're trying to build a scale essentially, and then we'll take that scale and we'll get our friends, let's say 10, 15, 20 friends to actually rate these, these samples for us after we've trained them. Training step two, have them read each video, step three . If it's a video, it could be something else, a picture, or it could be actually them pouring out the product if you have enough, etc. You can get the idea. Next, we'll take the average of all those ratings. We'll look at the data, make sure there's nothing funky in there and then we will use those average values to build a model. Let's start with, oops. Let's start with steps one and two. So we're going t o assume that we've looked at all the all the videos and the way we typically do it because it's a little easier is you start off answering the question, which one is the lowest in Chunky pour? That would be this one right here. Number one, I'm going to play each one of these. This just to make it clear, this is our scale. It's a continuous scale from 1 to 10 and the 1 to 10 is kind of arbitrary. If if you have something that works better for you then great. The video right above it corresponds to that. So this first video corresponds to a one. So as you can see, while we're watching this video pours nicely, no rippling and no chunkiness. Pours as expected. Beautiful. That's that's the easy sample to identify and then in the in the sample set, we ask ourselves, okay, which one is the worst? In this case, it was pretty obvious. I will tell you again, this product is not spoiled. So just with changing a few ingredients. You can see it's so thick, we can't even get it out of the bottom. So that's obviously a 10. Then we did a little bit of work to try to figure out, okay, which one should we consider to be a five? So halfway in between. This one, you can see it still flows, but there is chunkiness to it. Then maybe a two and a half would be this one. See it has a little less chunkiness to it. Flows well, probably with normal shaking. It'd probably be fine. So there's a little bit of subjectivity, but you add more people to make it more objective. Then the last one. This is seven and a half. So you can see it's very, very chunky. The only thing that really is differentiating it from number ten is that we can get it out of the bottle still flows. But as you can see, it's pretty thick. What I would do and basically in this amount of time, I could train the people that are going to help us to analyze this, to measure this chunky pour. Then we'll have them rate once we've trained them. I'll basically do what I just did. Maybe we'd take a little bit more time to be more specific with certain things we want them to be looking for. If what you're having someone rate is a lot more complicated, then you'll probably have to need to take more time training people. This one wasn't pretty complicated and we're really just looking for people's first impression. A fter that you have them rate all the videos i like to use Microsoft forms just because it's easy and I can get the the data really quickly and easily, but you can use whatever you want, including paper, although that takes more time and I try to avoid that. Just to show you what Microsoft, what our forms look like. Here's a preview of it. This is as if you're doing it on your phone. I like to make everything as simple as possible, and everybody always has their phone, so I can do it on a phone. That's my goal. I'm just saying chunky pour doughy, and then they just go through and rate each one. So chunky poor for treatment. One I'll say, don't know that that one was a six and we're just asking people for the first impression. There's no right or wrong answers. Usually people's first impression is right. So that's why I'm asking people not to think too hard on it. Maybe number two is a ten, and number three was a three. I don't know. They would go through all of those. Then we would get our data and then using JMP we would average all those ratings and then we put the data into the data table to build the model. So we're going to get out of PowerPoint for a second and we'll go to excel for a second. This is what I get when I want to export the data from Microsoft forms. Like I said, you don't have to use this, use whatever works for you. A s you can see, ID is the the rater number just a random number, not random, but just an identifier for each person. I left it anonymous so we don't. We don't criticize people who maybe didn't do as well as everybody else. And in this case, this actually this data is real from. I took this to a college class food science class and had them do this. And so this is actual real college students rating. The rating the the videos. And as you can see, we have all these columns, a column for each one. So person one rated, rated treatment one and eight, they rated treatment, two of four treatment three and nine, etcetera, etcetera, etcetera. So we want to get to put this into jump. So we have I like to use the jump add in. So in Excel right here. And then just as long as you're only highlighting one cell. And you click data table, it'll import everything. I've noticed that sometimes I'll accidentally have like just a portion of the data highlighted and if you could data table now it's only going to import what you highlight. So either highlight everything or only highlight one. Once you hit that data table button, you will get something like this. So this is our data. We need to in the end, just to show you where we're trying to get to with this data table because we have to manipulate it a little bit. This is our data table for the DoE. We run it was how much of ingredient A, B and C were in there. I put, we'll talk about this in a minute but I put my scale whether or not I thought something passed or whether or not I thought something failed. In the end, we need one more column that says Chunky pour. We'll call it continuous. And we'll have an average rating for for run one. Average rating for one, two, three, four, five, etc. If we look at this data table as it is today is not in that format because we need all these columns to be rows and we need the the rows to be in one column. There's probably a thousand different ways we could do this in JMP and they're all good and they're all correct. I'm going to show you one way to do it. It's just the one that works for me. First, what we're going to do is we're going to stack all of the columns on top of each other. Then we're going to do a summary table that has the average and maybe we'll also add in the standard deviation for fun. But the very first thing that I've always been taught to do is when you get data, you want to look at the graph, the data and look at the plot. So we're going to actually look at the distribution really quickly. So if we go to analyze. There we go. Analyze distribution. We want to look at the distribution for all of the treatments. I'm just going to highlight them. Go to the columns and say, okay. I'm just looking to see is there anything weird about this data that we should be concerned about? When I look at so we can see for 1, 2, 3 , etcetera, I'm looking for outliers, like for example, three, everybody rated this sample between 1 and 6. There was someone up here who rated it really high, and there's also someone up here that rated this one high. So what I like to do is if you click on this, it'll highlight where... So this this row represents one raider, one person. So I'm going to see how they rated everything and you can see they tend to be an outlier. The nice thing is in JMP is that once you highlight one row, all it will highlight for all the other responses. So I can see that, yeah, they rated 3 being higher 4 being higher. We go down, look. Terminate. They're opposite of everybody. It seems like for some reason the the training, they got a little confused and they thought higher number meant lower chunkiness and vice versa. So what I'm going to do is since I have this row highlighted, I'm going to close this, it'll stay highlighted. So this is row one. I'm just going to delete this data and then we'll move on. Now we feel pretty comfortable with the data is pretty much solid. Like I said, we're going to stack the columns. If we go to tables stack. It's going to pop up and we just want to stack all 17 of the treatments. The nice thing is in JMP 17, now you get this preview. I love the preview so then I know if I'm doing things right. What we see here is, as I can see, it'll have the ID so the rater and then rate the chunky pour for treatment 1. They gave it a five and they did number two, a seven. This is how we want the data structured and we can change the column names. So instead of data, we're just going to say chunky pour, continuous. Then for label, I'm just going to call it run because that's really what we're going to use this for in a minute. I just stack it. So I say, okay, that's how I want it. Now we have the data table in this way so now it lets us use a summary table. S ummary tables are nice ways to be able to make a table of the of different statistics. So what we're going to do is we're going to highlight the chunky pour continuous column and say statistics. Do mean. For fun in case we want to use it, we'll also say standard deviation. This just gives us the overall mean and standard deviation. But if we want to do it per run, I'll highlight, run and put it here in group. Now when we look at this preview, we have one through 17 and conveniently, they're in order. One, two, three, four, five, six, seven, eight. All the way to 17. We have the mean and the standard deviation. So we're going to say, okay. Okay, so we have one more table. Now we're to the point where we're where we need to be because I have each run as a row and have a column for the average column for the standard deviation. So what I'm going to do is I will highlight this column. If you go to edit copy with column names and then I'm going to go to our original data table. We're gonna make a new column here and say edit paste with column names. There it is. I should have done both of those at the same time, but I didn't. So we're going to do. Do this one as well. Okay, so now we are ready to do our modeling. So first, first thing I want to show you is what we would get if we just did pass fail our binary response. What we'll do is if we go to analyze fit model. Because I made this this design in JMP in the custom design platform, it automatically knows what kind of design this is so that's why my model is already built. If there is a really convenient way, if you knew this was a response surface design, let's say, let's say this wasn't here. The macros are convenient. If I highlighted ingredient A, B and C. Said Macros Response Service. It pulls it all up. It already knows what I'm looking for. So that's helpful. I put it in the y axis, the variable, the response y chunky pour pass fail. What it gives us is nominal logistic. I'm not statistician, so I'm not going to go into any of the statistics behind what it's doing. I'm just going to show you how what you get out of it and what a scientist might be looking at. So if I say run, our target level is passed. So when it's going to do probabilities and probability of passing. So we say run. This is what we get. So, I mean, the first thing that a scientists like myself would probably look at is this effects summary. I'm looking at probably looking at P values and I say, well, nothing significant except ingredient A. There are other things that we would look at, but I'm going to... I'm going to go over that. We're not going to cover that today. Instead, I want to just look at the profiler, because that's what we find, at least in our in our experience, the profiler being the most useful and easiest to interpret for the scientists and when they're communicating with others. So what this is, is I'm going to make it a little bigger. Is on the left here. We're going to get a probability of failing and a probability of passing. So if we have 0.13 of ingredient, a 0.12 of ingridient B, 0.45 and of ingredient C, and it's actually 0.13%, 0.12%, 0.45%. I just didn't change it. It's a very, very small proportion of the formula that we're changing anyways at those levels, this says 100% of the time we're going to pass. If I move it up, let's say to.. Have like, say point two of this ingredient now. Now, looks like we're going to pass only 64% of the time. You can see these curves, how I changed ingredient B a little bit and ingredient C, maybe we can get back up to a point where we pass 98% of the time. You can play around with this. But the problem with this is, is like I said earlier, passing. Maybe this pass right here is not the same as passing over here. However, we don't really know that with this information, and it's kind of hard thing for some people to wrap their head around, like it was just probability of passing. What do I do if if all I can get is an 85% pass rate? Like, let's say hypothetically, this was the best we could do. What do I do with that? So that's why we're looking at continuous responses. I'm just going to close this and we're going to do that, build that model again, except let's do it for the mean of our continuous scale. So we're going to have to remove chunky pour and we're going to add the average here. We're just going to say run. Keep it simple. Do the effects screening report. Now you can see there's a lot more information going on that we didn't get before. So where before, if you remember, all we saw was that ingredient A had a really low P value. Everything else was like 0.99. The conclusion was ingredient A does everything. Well, it's not actually the whole truth, as we can see here. Yes, ingredient is the most and most important. The main effect of ingredient right here. But B and C also have a role to play. While not as big, they're still an important role. So we look at our actual predicted plot. It looks pretty healthy. Our lack of fit. Look s good. I'm not going to go into all the details of everything that we look at, mainly because I'm not statistician. That's just what I look at. I'll look at the lack of fit. I'll look at the residuals to see if there's anything weird, the studentized residuals. Then really, I come to the profiler and now you can see this gives us a much different picture, much more complete picture, where as I increase ingredient A, the chunky pore increases, but increasing these these ones does too. So they they also have a role to play. If we were to say that we want to minimize it, I think it's pretty obvious what the... Desirability is going to come out to being. But just to show you, we're going to you go to the red Triangle by the prediction profiler. Optimization desirability and we're going to do the desirability function. Then here, this is the desirability. I find it useful. You can change it in the red triangle, but I find it easier if you just hit control and then click on it. Now we can change what our goal is. So in this case, we want to minimize this because we don't want it right? We don't like chunky pour . Consumers don't like it either. So we're just going to say minimize and okay. Now we can go back to that optimization and desirability and say maximize it. What I thought I was going to do. Say, take these two ingredients out. Put this one as low as you can. You'll get the the lowest chunky pour that you can. In reality, we had some other constraints, so we couldn't do that. There were other factors at play, but this definitely gave us a really good idea of where we needed to go, what was important and how do we control this chunky pour to the point where when we implemented the changes, the complaints went away. It's been good ever since. That is the the nutshell of how you could take something that is hard to measure. It's really subjective. It's binary so you pass fail or good or bad, and you can convert it into something that's continuous. It's a relatively simple method. You can use it for a number of things. As long as you have people available to help you out, you can you can measure a lot of things that could be considered hard to measure. Where do we go from here? At Hood. Just to give you an example of some other things that we encountered. This one, the Chunky Pour, is actually one that's a little easier to do. But let's say this is another product we were working on a long time ago where let's say you have coffee and you're going to add some foam to it and you want to understand how well does that foam dissipate into the coffee? That's a that's a tough thing to measure. We definitely don't have any instrumentation that can really measure it. Videos really helped us to understand how we could measure it and get some useful information out of... As you can see, we're trying to measure how does that look? How well does it move that one versus, let's say, this treatment over here? You can see they're quite different. Where one moves really fast, the other moves really slow. This one looks kind of chunky the other one didn't so much. That's that's how we use it. We use it quite often. I appreciate you taking the time to listen to my talk. Hopefully, I hope that this has been useful. You'll be able to find a way that you can implement it to in in your day to day work. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Effective data analytics is critical for the success of product development. Product development processes for adhesives and sealants involve data analytics for tasks needed at various project phases. Too many formulation chemists still depend on Excel for daily data recording, formulating, and analysis. While some chemists use JMP for statistical analysis and experimental design, JMP’s powerful and versatile data analytics should be used more universally for many daily tasks in product development. When shown how widely JMP can be used, more formulation chemists should see the benefits of adopting JMP as a daily tool. This presentation shares examples of how easily JMP can be used in daily data analytics, in addition to the statistical analysis and DOE that most chemists use. A variety of examples are given, including handling data from literature and patent searches, analysis of raw materials, and a variety of formulation-related tasks such as formula stoichiometry calculation, mixing volume balance, data recording, and in tabulate/graph analysis. In addition, this presentation demonstrates how JMP is used for product selection, application trouble shooting, failure mode analysis, DOE, and more. Hi , this is Stone Cheng , I 'm a technical director in Henkel Corporation . I 've been using JMP for more than five years in product development . Today I 'm happy to share with our best practice in utilizing JMP data analytics in product development of adhesive and sealants . My presentation has two parts . The part one is application gallery where JMP used in various stage in product development will be discussed, and in part two I will focus on using JMP as a formulation worksheet with the demonstration . In my organization , folks either have not heard of JMP or have an impression that JMP is an advanced DOE software . For the latter, it is true , but it 's not the whole truth . Since there are other DOE software options , it is hard to persuade folks to switch away from tool they are familiar with . JMP is an excellent all- around data analytics tool . To promote JMP adoption , we need example to demonstrate its power in the data analytics . In my presentation , the example are taken from my experience in adhesive and sealants . About Henkel . Henkel is a 22- billion enterprise with two business unit : adhesive technology and consumer brand of laundry , home, and beauty care . I 'm in the adhesive technology unit . We are global adhesive leader serving 800 industries with more than 20 ,000 products . Let 's start with case number 1 . One customer has a need , they may call a service center asking for product recommendation . For example , a customer may need a room temperature adhesive hook with the shear strength between 1 ,500 -3 ,000 and a work life between 15 -30 minutes and in the package of 10 . What we need is a searchable product selected guide , what is showing in our case 1 . Once the product information are entered into the JMP table , one can use the tabulate and local data filter to downselect the product based on the customer requirement as shown in here . This is a very, very powerful tool in the preformation stage . My team has been applying this tool to search for formulation , pattern , literature and the raw material specifications . In case number 2 , multivariate analysis is applied to a silver filler , which are used in making conductive adhesive . From multivariable analysis , if we focus on the surface area , it has a moderate negative correlation with the tap density and then it has a stronger positive correlation to lubricant amount as measure at the weight loss at 530 AC . There is a graphical way here and it 's a table format here . With this analogy , chemistry can select the right silver package for the application . Case 3 is the literature . Literature is a great place for learning the chemistry and formulation and this particular cited literature illustrates how epoxy tensile strength are affected by the mixture of two amine hardener . The results are described qualitatively by a table and a graph of the stress train curve here . For formulator quantitative description is good , but learning via quantitative modeling is even better for the prediction purpose . In case number 3 , we take the literature data and then create a two -factor column right here . These are the epoxy /amine stoichiometry , or we call it the index . The other factor is the fraction of one of the amine PAE to the total amine fraction . With the Fit Model platform, and we use the ISM model , it showed that the quadratic effect of the index, together with two main effects, are all significant . The prediction provider and the contour profiler are used to quantify the learning and give the prediction . In case number 4 , my group was assigned to support a technology platform that include about 30 products . Since we are not involved in the original formulation development , how to study the formulation family in this case is not trivial . Looking at a big Excel table with all the formulation certainly is not effective either . Case number 4 is the example of addressing this challenge . I select the three top most used ingredient in these 29 formulation : monomer 1 , 2, and then oligomer . Then by using the hierarchical clustering analysis right here , right here our formula was identified to have very , very small distance assumed right here , implying that they are in cross related . Actually they are only different in the photo initiator for different wavelength in this case . We can add more ingredients one by one in this hierarchical clustering and then learn the formulation family by using this method . Most of the chemist analyze the formulation performance in Excel . Case number 5 is a JMP tabulate that has the same data format as Excel . Basically , various information of a formula are displaced in the same column . Like what you see here has a heading , has a recipe , has a processing material characteristic, and all the results . To create such data structure , we need to enter data in JMP table in a special format and that will be discussed in detail in part two of my presentation . Case 6 is a silicone study involved 23 formulations and more than 10 measurement property . It is quite overwhelming trying to analyze the raw data in such big system . We first construct a series of graph with a property retention in Y and then the initial property in X . For example , if you look at the first graph here , the adhesion retention is in Y and initial adhesions in X . We also give the reference line , acceptance line for each axis . When a formulation is selected , for example , I choose this point , they are in the quadrant with acceptable initial adhesion and retention , then all its associate property such as the tensile strength , elongation , hardness , all show up , also formulation all show up at the same time , these are all thanks to automatic highlight . This is all thanks to JMP dynamic link capability . Visualization analysis in such a way is very effective for chemists to know the overall behavior of this system . In polymer science , we measure the modulus of polymer as function of the temperature with an instrument called dynamic mechanical analyzer , DMA . DMA data has a temperature modulus and then attend delta are typically transferred to Excel in a wide format for plotting . To overlay several DMA curve for comparison it is durable in Excel , but it 's not a trivial effort . In case number 7 , we stack 74 DMA results together and by using the graph builder , we can compare DMA results very , very quickly just by clicking . I cannot imagine doing the same in Excel that has a 222 column . It 's basically 74 sample times 3 signal per sample . It 's going to be very difficult to handle in Excel environment . A graph builder is excellent in turning a very complicated graph seen in Excel into a visually digestible analysis . In case number 8 , the needle pull strength is illustrated in graph builder using four variable . We have eight adhesive on the top , we have three different radiation system on the Y here , and we also have a four radiation time and then two needle hub combination . See how easy it is to understand this JMP graph as compared to the Excel graph right here . Good . In case number 9 , we are conducting accelerated aging study for four epoxy prototype formulation by measuring their initial adhesion on three substrate and with a three replica . The aging condition are two different temperature and then eight week aging time with two weeks testing interval . This aging design and the data was initially recorded in Excel and we converted the Excel data into JMP table with seven column , seven column only and the stack format and then we make a graph . You will agree that the visualization in JMP graph builder in this case is much easier to see the aging performance than looking at the busy Excel table here . In formulation stage , we frequently need to optimize composition . Case number 10 is example where two catalysts in polyurethane are optimized with the DOE design . The factor are : catalyst ratio and the catalyst total amount . There is a 10 -run face -center central composite design , the predictor provider indicating that the catalyst total amount factor has a long linear effect on the work life . The white area in the contour profiler is actually the suite design space which desire work life . In this case , 20 -28 . It is important for chemistry to select this green highlighted white area for better production robustness than the area [inaudible 00:12:10] in blue , it has a higher tolerance for the amount change there just in case operator makes some minor mistake . That 's why it has a better production robustness . We routinely see chemistry perform statistics analysis of adhesion data like what you see here , but we hardly see anybody presenting the results about the failure mode analysis . In JMP the failure mode analysis can be performed in two places . One is in the contingency analysis in Y by X platform and the second one is using the graph builder . The case number 11 is the example applying to the silicone sealant where the failure mode change before and after high temperature aging , before and after is clearly shown . Clearly shown , Beside a good adhesion , adhesive rheology will need to be formulated so that it can be effectively applied to the substrate . We have a project to develop a seam sealant to protect the metal joint by spraying process . Case number 12 is the example that illustrating the complication of spraying process with respect to the sealant viscosity variation . We have three spraying variable . They are the pressure , nozzle type , and then the head travel speed and then we have one material factor in this case is the viscosity . Initially , we plan the experiment in factorial DOE fashion , but one of the factor , the sparing pressure are very very hard to control . We end up performing regression of the 40 round with ISM model using the strain width and strain thickness as the response . We get a good model with R -squared about 0 .91 for width and 0 .81 for the thickness . The modeling result tell us that the spraying condition will need to be adjusted dependent on the sealant viscosity . This is illustrated in this prediction profile here . Each processing parameter has their own machine limit and also desirable operation limit . If this predicted processing variable are outside those limits , then the chemist will need to redesign formulation reality and making sure that manufacturing has the processing capability to meet the viscosity requirement . This example show that the formulation design and application constraint will need to be considered side by side and JMP is actually a very good tool in facilitating this type of study . Case number 13 is example that JMP is used to handle huge instrumentation data sets . In testing thermal interface material , the temperature at a different location and the power consumption data are collected and then uploaded to the JMP . Once the data are in JMP table , visualization of the data and data analysis of data set as much as 500 ,000 row are still very manageable and has a fast response . That means the geometry actually can be used to handle instrumentation data . We have a project to apply adhesive to software by the sensor printing and this application is challenging with pinhole defect issue . Process engineer changed six processing variable randomly and then collect 21 wrong results . His data analysis did not reveal any special trend , so the JMP was then used for the troubleshooting in this case and the prediction partition analysis has identified factor F as the key factor . Later on we apply the predictor screening analysis and then identify additional factor D that need a further investigation . For JMP training , we learned that the predictor screening can identify predictor , they may be weak alone , but strong when they are used in combination with other predictor . In the scaling up and the manufacturing stage production , when the batch run into the issue , the raw material lot -to -lot analysis is one of the troubleshooting item in order to isolate the potential raw material effect . This exercise is typically done in the Excel table . But when the multiple raw material and multiple lots of each raw material are involved , it is difficult to look at a huge Excel table to analyze the raw material effect . In case number 15 , a polyester formulation with three raw material and about 45 separate lots are plotted verses the date of manufacturing with the color scale of the gel time . This heat map plot provide a visual analysis for the production engineer to determine whether a particular loss of raw material is the major cause of the out of spec batch . We turn the Excel table into a visual way for better analysis . Statistics comparison in T -test or ANOVA analysis are performed routinely in the product development . A product benchmark exercise typically involves multiple product running under various testing protocol , aiming to have a very comprehensive the product comparison learning here . Case 16 is an example of statistic analysis involved large combination of 23 products and then more than 10 testing protocol . In JMP , a large volume statistics analysis is not a challenge since creating of the sub -table is not required in this case , as compared to other software . One can utilize the column switchers and the local data filter to create all the combination of property and adhesive for statistics analysis . Plus the results of each analysis can be copied into a JMP journal to streamline the reporting . For case number 17 , the needle bonding testing of light cured , historically , have a high data variance . Case 17 use JMP to summarize 18 reports of needle -bound testing which involve multiple lots of adhesive , and those are tested in various time . The needle pore strand , its the COV , are plotted in graph builder under various lighting , radiation condition , as well as the substrate combination . With the local data filter here , one can easily change the criteria selection to have a comprehensive comparison of this adhesive and their consistency performance . When this result was presented , everyone was amazed with the JMP capability . It is so versatile and so powerful . This is the last case for the application gallery . In this case , number 18 , we use the parallel plot feature in the graph builder to demonstrate visual comparison of 15 performance items and 10 adhesives . Each performance has its own unit and scale which provide a visual comparison more quantitatively in contrast to the qualitatively comparison in spider chart which is used in Excel . So far , in the 18 application gallery examples , the data are coming from literature , instrumentation , processing , and not much emphasis on formulated . Now we will switch gear to discuss formulation creation , use worksheets , and it 's a JMP -based worksheet , not a traditional one using Excel . Before we show you the JMP worksheet , let 's discuss about adhesive type . Broadly speaking , adhesive can be divided in two categories : one component adhesives or the two component adhesives , or 1K or 2K . A 1K system like the Super Glue everybody knows require no mixing and it can be cured by moisture , by light , by heat , or by other method . In case we are dealing with one component but heat cure adhesive such as epoxy , then we will need to design and then calculate the stoichiometry or the index to balance the proportion of the epoxy to the amine hardener . Then for the two component system , 2K system , their mixture will react at NDM temperature so that they are kept apart before use . In a 2K system , their stoichiometry will need to be designed and calculated based upon the desired mixing ratio , either by weight or by volume . There are some formulation calculation here we need to perform . This type of calculation design historically been done in Excel . This is the Excel . Everybody know that Excel spreadsheet allow mixed data type in the same column and its formulas can be applied to individual sales level that make it very flexible as a formulation calculation worksheet . Formula are typically organized in column format like this . Each column has a full group of formulation information such as their heading , which is the ID , their recipe ingredient , the formulation characteristic or processing parameter , and followed by the result . What about the result ? Excel -based worksheet is very useful . Everybody using that because it 's easy to learn , but it does come with some shortcoming such as first of all the row matching . When you have a new ingredient or new testing results , you need to match to the right row , and they take time . Then one may need to hide or unhide a column for comparison . Then third thing is it 's harder to analyze the data when the results are put in different tab . It 's a tab -to -tab format . It 's also very difficult to make a graph in such kind of a data structure . JMP offer webinars to go beyond the Excel spreadsheet in various features as listed here . But the worksheet calculation is not emphasized . Perhaps this is due to the inherent data structure that each column cannot have a mixed data type and the column formulas is applied to the entire column which is not as versatile or flexible as compared to the Excel . Despite of these constraints , we have developed JMP worksheet with the following objectives in mind . It should have a broader capability for formulation design , calculation , recording , and analysis . It is all in one and we want to minimize cross -platform copy -pasting . It should be easy to operate , easy data entry and use the JSL for a lot of the automation . Then the final data set is ready for machine learning exercise . Let 's look at our Gen1 , and that is for one component system . This includes four data group . We have a formulation ID , we have a recipe , we also have a material processing characteristic , and then we have a testing result right there . The four data group are the same as what you see in the earlier Excel worksheet , but layer structure was organized in the column from the left to right . This is different from the Excel which is from top to the bottom . The data of the three group , 2 , 3 , and 4 are shared and recorded in the same column , which has a numeric data type . All the recipe , all the testing results , and all the formulation characteristics all in the numerical data type , and they are documented in the same column here . With this kind of a format … The data was also stacked together . I have formulation 1 here , formulation 2 here . With a stacking format , one can freely enter the new ingredient or new testing item without needed to match the role as needed in Excel . JSL was also created to enable data analysis in either tabular way or in a graph format . This is in a tabular way . Chemist can pick several formulation ID and compare their recipe characteristic and performance in a very , very condensed format here . This is very different from Excel without needing to hide /unhide columns to bring formulation to be adjacent to each other . Much , much easier under the JMP format here . Besides tabulation , one can make a graph of the property versus the property comments or the sample ID , but not the ingredient percentage . This graph can be combined with the recipe table here into a group under the dashboard operation . This make it as a very effective visualization analysis . As for testing involves multiple replicates . We typically just record the average result . But one can enter the individual replicate data in the property column , and then perform the T test , the all -over test , using this worksheet here . In case people doesn 't want to enter data in this way , there is the other way to virtually link the data file with the replication result with the worksheet . That will be shown later in the presentation . So far , what you see is our Gen1 worksheet which involves no formulation calculation . Chemists in my group has been using this tool for more than one year . They get used it its easy data entry and very , very powerful tabulation analysis . Next we 're going to look at the Gen 2 worksheet that can overtake the Gen1 feature . It has an additional feature for the formulation calculation for the 1k and 2k system . This worksheet also link with the other JMP file that has additional raw material information needed for calculation . We have the other worksheets , we call Gen 3 , that are designed to deal with the solvent borne system . It also allow formulator to incorporate master batches , but due to the time constraint it will not be discussed here . This is our Gen2 worksheet . There are three sections . We have a heading and then the formulation input section right here . The middle one , we have a calculation output . The third section is the processing material characteristic and also the testing results . Section 1 and section 3 are like the one in Gen1 , but the section 2 here is newly added . The column row name is used to link the reference file that has additional data information needed for calculation . You can see the symbol for the virtual link right here . After chemist enter the formulation ID , they will specify for columns , parts , row , name , and initial weight . If they are doing the 2K system , they need to also specify the mixing ratio either by index , by volume , or by weight ratio . Then the worksheet will output the mixing ratio characteristic here again by index , by volume , or by weight . They also provide a normalized composition , either by part . By part means A and B sum up together by themselves and equal to 100 , or A and B mixed together . We call it normalized by total here . After seeing this one and the chemist can perform the experiment and then come back to enter the results right here . The other thing is in the property material characteristic , we have the other column called Lookup . This can extract the information from the calculation and also the raw material fraction percentage ratio and automatically displays right here . Then chemists just need to copy parameter in the value enter column and then this will be automatically transferred to the two normalized percentage column for display purpose . We also have three JSL there to facilitate in analysis . The first one is showing you normalization , normalized by total . That means A and B being mixed together and sum up to 100 . Here , I showed you the formula , showed you the characteristic and showed you the result . You have a second JSL that 's normalized by part . In this case , you can see your part A formulation and part B formulation , and then A and B all have been normalized to 100 by themselves . With the other JSL , we can change the formulation worksheet format from the stacked to the white format . In this case their ID performance , individual ingredient , and then the characteristic will all have their own individual columns . With this format , one can make the graph with the property versus the ingredient percentage which cannot be done under the stack format . One can also looking for the correlation between the performance or the performance with the formulation characteristic . At this moment , I like to show you the live demonstration . This is the formulation worksheet I just showed you in the PowerPoint . Basically , we have the heading . Then we have a formulation input section . We have a calculation between n1 and n2 . Anything here is for calculation . Then we have the last section here , that is a performance and then the property material characteristic . I mentioned that we have a JSL , allow people to look at this result easily . Let 's look at this one , JSL by total . We can easily highlight any formulation or compare 2 and 8 , and then compare their formulation and their result . These are mixed together . We can look at it by part . Part A here and then part B here . They all sum up to a hundred by themselves . Easily , we can compare … Oh no , I need to remove this one first . I can compare formulation easily by manipulating the local data filter . Again with the JSL , we click the Join All . We are turning the stack format into a wider format . Each row belong to one formulation with the heading here , with their property , with their formulation , and with their formulation characteristics showing right here . For machine learning , we can highlight a role ingredient and then just manually add zero so that each ingredient has zero or whatever , and then now we can do this one . We can create a summation or something , easy to operate in this . I 'm going to show you next how this one work in the sense that assuming that we 're going to create a formulation . I 'm going to copy the heading . Sorry , I 'm going to delete everything here because I create this one already before . I 'm going to delete the demonstration one . I 'm going to create it from scratch by copying the heading here . I change the name to Demonstration here . I will copy the formulation because I 'm going to modify formulation from this one , the DOE 8 . Then the DOE 8 is based on one -to -one mixing ratio by volume . But in this new one , we could change it to one -to -two mixing . A divided by B is one divided by two , so it will be 0 .5 . Then I copy the heading including the mixing ratio all the way down . Now all the calculation has been done here . With this weight percentage I 'm entering , it showed that the material has an index model ratio A to B to be 0 .65 , which is too low . We need to , using our chemistry knowledge , to turn this around . In this case , for example , I make this one 2 . I can easily make this one into 1 .05 . That is the range I 'm looking for . Basically , assuming it is the design that we want , formulation we want , the next thing we want to do is to copy some of the testing that we already had before , that we are monitoring before , but without the results , of course . We have a new result here , so I 'm going to delete that one . But we also want to add additional property which for example is viscosity measure at a room temperature . With this section here , then we want to extend our heading to specify those are belong to this formulation . As soon as I specify the heading , the Lookup automatically give me the information such as the missing characteristic . 1 .5 or 0 .5 , they are automatically copied to here through the Lookup function and then the feeder loading in the formulation normalized to Total while also being extracted , sum up together and put it right here . Now I can copy this information , put them in value enter , and specify my mixer is number 2 , and then start to enter my results , time that 's going to be 80 , and adhesion 450 assuming , viscosity 20 ,000 . I 'm pretty much finished everything , so let 's look at the result here . We just enter Demonstration . This one was based on the DOE number five . DOE number five is one to one mixing and this Demo is only one to two mixing , and we added the viscosity result right here . It 's very , very easy . One click you see the result and in the format it 's very easy to understand for comparison . This is the end of my demonstration . Let me go back to the presentation here . We consider the JMP worksheet that I 'm just showing you is an integrated platform and here is the summary . The worksheet in the stack format , here , is used for formulation design , calculation and for recording the results . The data entry of raw material which is needed for the worksheet is minimized by virtually linked with the other file that has additional raw material information . JSL was widely used to automate the worksheet output to the tabulate , to graphic , to the statistic analysis , and also to create a table with wide data format . The wide data format , they already have a data structure for modeling via the machine learning and also allow the graphical analysis using the ingredient as one of the axis . Then since each of the row in this wide format is a unique tool formulation ID , this actually can be used as a reference table to join the other JMP file that has a testing result that has a replication . When these are joined together , then we can plot the raw data and do statistic analysis , either as function of the ingredient or as function of the formulation ID . This JMP Integrated Worksheet Platform truly illustrates it is an all -in -one platform , very , very capable . In summary , JMP is not just an advanced DOE software . JMP 's data analytics has been effectively utilized in my group for product development at various stage to speed up the innovation process . JMP -based formulation worksheet is an integrated platform that feature broad formulation capability , all in one , easy operation , and machine learning ready data structure , and more and more waiting to be further explored . With this , thanks for your attention and I also like to acknowledge the people I work with and learning to JMP together and also our management system for supporting JMP adoption initiative . Thank you very much .

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

The development of innovative new products can be accelerated using statistically optimized DOE and regression modeling. With the goal of maximizing efficiency and reducing expense, it is often tempting to limit the collected data to key product attributes, such as customer specifications or internal quality metrics. However, increasing the number of available responses by including a wider range of more fundamental measurements in the analysis can often be critical to success. This talk covers examples from projects at PPG’s Coatings Innovation Center. We highlight the use of JMP from design through analysis to visualization of the results for a fractional factorial and a constrained mixture/process design. Using tools such as column switcher, multivariate plots, pairwise correlation and mixture profiler, we demonstrate how the inclusion of ancillary responses helped to develop a deeper understanding of the systems being studied and pinpoint the causes behind unexpected results. My name is David Fenn. I work for PPG Industries at their Coatings Innovation Center just outside of Pittsburgh. Today I'm going to be talking about some of our experiences with using DOE for novel coatings development, particularly focusing on the importance of using auxiliary responses. The agenda, I'll talk a little bit about research methodology, and in case you're wondering what I mean by auxiliary responses, I'll define that in this section. Then we'll go on to two examples. The first one, a new resin development for architectural coatings, and the second one, a protective coating, then we'll finish off with a few general observations. There are various frameworks that can be used to describe the new product development process. One that I particularly like is shown here, DMADV. Here we have five stages. The first stage is to define the goals of the project. What are we trying to achieve? Then we get into measurement. What are the critical characteristics we need to measure, and do we have suitable processes in place to measure them? Then we think about analyze . What factors can we change to make improvements? Then onto the design stage, where we deliberately manipulate those factors and the levels of those factors to try and affect an improvement and lead to optimum performance. Then once we have an advanced prototype, we get onto verification, thinking about will our solution work in the real world? What are the important things we need to think about when we apply this framework? Well, in terms of the defined stage, the goal needs to be clear so that the whole organization has the same understanding of the goals, and it needs to be impactful. If we're successful and we deliver a solution, will it fill a real unmet need in the marketplace and be a successful product? If we skip now to the end, any solution we apply needs to be cost- effective. It needs to be robust. Then the middle of this process, we want to get through this process as quickly and as efficiently as we can. We want to deliver the product to the marketplace as soon as we can, and we want to expend the minimum amount of cash and the minimum amount of resource to do that. Clearly, DOEs and a lot of the tools that are available in JMP are well set up to make us succeed in this area. One of the tools that I like to use particularly early on in a project is a process map. This is a very particular type of process map. It's really mapping the process of carrying out the research and development. I'm showing here a simplified example of a process map to develop an automotive base coat. We have all the steps that are involved in our experiment. We make a resin, we use that resin to make a base coat paint, we spray apply that base coat onto a substrate, we apply a top coat onto that base coat, then we cure those coatings together, and then we measure the properties that we get from all of that. All of these steps, we list all of the factors that might play a role in these separate steps. This is useful for a number of reasons. First of all, it gives everybody in the team a unified understanding of what the process is we're dealing with and how are we going to affect it. It also allows us to capture all of the variables we can think of that might play a role in the various steps so we don't overlook anything. Then it's a good starting point for thinking about which of these are we going to try and manipulate, which of these are we going to focus on to try and deliver a successful project? These factors are further subdivided and categorized. First, we have our Xs. These are the variables that we can manipulate to try and affect an improvement in our product or our process. Then we have our big Ys. These probably appear in the specification of the product. These are what we're really trying to achieve. This is what the customer really cares about, what the customer will pay for. Next, we have our Ns, noise variables. These could be variables that we may be not controlling, we're not deliberately manipulating, but things that could introduce noise into the process, either during the experiments, during the new product development, or in the end application, in the manufacture of the product or the end use of the product. Then finally, the subject of today's talk, we have our auxiliary responses, which we label as little Ys. These might not appear in the specification, the customer might not even be aware of these, but they're measurements we can take at various stages of the process that might tell us something about what's going on. I said in the previous slide that one of our goals is to get through this whole process quickly, as in efficiently as we possibly can. One question that raises is, why don't we just measure our big Ys? We have the ability to carry out DOE's. We could optimize for our big Ys, we could build predictive models. Isn't that all we need to do? Why should we spend time? Why should we spend money measuring some of these little Ys when they're not the real goal of the outcome? Well, I hope in the next couple of examples that I can show you, some cases where carefully selecting these little Ys and doing some good analysis can be really critical to the success of a project. Our first example here, the development of a new resin for architectural coatings. The goal was to come up with a single resin that could meet all of the performance requirements across several product lines in several countries. Our starting point was, we had no single resin that could meet all those requirements. We were using different resins in different products, different resins in different countries, and we needed to come up with a solution that allowed us to reduce the complexity there. Our early prototype struggled in a number of areas, but one particular area was tint strength. The way these white base paints would be used is if I go into a store and request a paint of a particular color to paint the walls of my house, the store will take that white base paint and add specified amounts of concentrated color toners to that paint to create a specific color. It's really critical to be able to hit a target tint strength, which is the measurement of how quickly that color will change as we add a certain amount of a particular toner. We need to be able to control that and hit it reproducibly to achieve the wide spectrum of colors we need to achieve. We also had a few issues in terms of poor heat age stability and poor resin reproducibility. Our approach was to carry out some sequential DOE's to learn how to control tint strengths and some of the other factors. I'm showing the progress on this plot at the bottom left-hand side of this screen. Before we started the DOE's, just some of the exploratory experiments; the orange bar represents the range of tint strengths we were able to achieve. We can see that is far below the target range of tint strengths shown by this green bar on the plot. As we carried out the DOE's, we learned how to control tint strength. We were able to increase it until towards the end of the project when we were doing our optimization DOE's, we were nicely centered around this target tint strength. We were able to build predictive models and use those in conjunction with predictive models for some of the other key properties to identify white space where we met all of the target properties at the same time. But rather than talk about the whole project, I want to now focus on one particular DOE that we carried out. The goal of this DOE was to confirm and quantify something we'd observed previously, that the particle size of the resin we were making was a big factor in controlling tint strength. These resins are, in effect, dispersions of little particles of resin in water, and it was the size of those particles that seemed to be important. We were also using what we call a co-surfactant to help disperse those particles, and we had a few choices about where in the process we could add that co-surfactant. We wanted to look at a couple of candidates for the addition point of that co-surfactant to see if it affected the key properties. Then finally, up until this point, all the resins we've made, we've made at the Coatings Innovation Center. We now wanted to check, could we make these resins reproducibly across three different locations? The DOE we carried out is shown on the right-hand side here. We have three levels for our target particle size. We have two levels for the addition point of the co-surfactant. That gives us a full factorial DOE with six runs. Then we replicated that DOE across three different laboratories. I'll go straight into JMP and I'll show you what the data table looks like. You can see here we have the original data table, the DOE, but now we have a whole collection of data that we gathered during the DOE. The first thing we'll do is, we'll look at what we learned about tint strength. I've already built here a reduced model for tint strength. If we have a look at the effect summary to start with, we can see that the location of addition of the co-surfactant wasn't a factor in determining tint strength. That dropped out of the model. But we do see that the target particle size and the reactor location were factors, as well as the interaction between target particle size and reactor location. If we look up at our actual by predicted plot, we can see it looks like a pretty nice model. We've got a nice R-square, and everything looks to be in pretty good shape. Then probably the best way of understanding what's happening and what this model is telling us is to look at the prediction profiler here at the bottom. We see our anticipated effect of target particle size on tint strength. As we increase target particle size, we get higher tint strength. Then if we look across at reactor location, what we see is that Lab A and Lab C are giving broadly similar results. But if we look at Lab B, first of all, we see that the tint strength that we get from Lab B is significantly higher than we were getting from Lab A or Lab C. We also see that the dependence on particle size is much less from Lab B than we saw from the other two labs. This was a problem for us. Whenever we see that different labs are producing different results with the same resin and the same process, it can be a really long task to work out what's going on here. There's so many potential candidates for the cause of this poor reproducibility. At this stage, we were very concerned that it was going to take us a long time to resolve this, that it was going to derail the project, and we're going to miss our target launch dates. Before we went into any specific activity to try and address this problem, the obvious first step was to look at the data that we'd already gathered in this data table and see if there were any clues that could maybe give us a hint as to why Lab B was giving different properties. Whenever I see a wide data table like we've got here, one of the first tools that I always go to is the column switcher. The way in this case that I will implement this is the first step is to build a variability chart that best shows the problem that we're having. I've pre-built a variability chart here where I've got target particle size and reactor location as my X-axis and I've got the initial tint strength as my Y-axis. The first task is to get this into a format that the best represents the problem we're dealing with. The first thing I'll do is swap over my target particle size and react or location. I'll also add and connect the cell means to add some lines here. Now I'm pretty happy with this. I think this nicely reflects the problem that we're dealing with. We can see Lab A and Lab C very similar results, but Lab B, higher tint strength and less dependence on tint strength and particle size. Now I can use my column switcher, and what this will allow me to do is keep this plot in exactly this format, but quickly switch out this Y-axis, the initial tint strength, for any other variable that I've got in my data table. I'll go into the redo platform and select the column switcher. Now I can select any of the other factors in my data table. I'm just going to select everything that I've got in my data table. Then when I hit OK, I now have this column switcher to the left of my plot. I can click on any of these factors and it will change this axis but keep the plot in exactly the same format. If I select particle size, I can see now I'm plotting my actual measured particle size against target particle size and reactor location, exactly the same format. It looks like in this case, all three labs are giving pretty similar results. I'm not seeing anything that gives me a clue as to what's going on, but I can quickly just scroll through this whole data set. I'm seeing mostly noise right now. I'm not seeing any patterns that seem to be particularly helpful, but I'll keep going. When I get to this plot here, so now I'm plotting conductivity, I see interestingly that Lab B is making resins with much higher conductivity than Lab A and Lab C. That's one useful observation. I'll keep going. This next one, this is actually another measurement of conductivity after the resin has been in a hot room for a week, showing the same thing, still confirming that Lab B is giving higher conductivity. I'll keep going. Mostly noise, maybe a little bit of an indication that the molecular weight from Lab B is slightly lower. I'll keep going. Again, still not seeing anything that interesting, mostly noise. But then I get to this plot here, and again, now we're plotting the pH of the resins one hour into the process, so early into the process, the acidity or pH of the resin. Lab B, again, is different from Lab A and Lab C. It's giving me much higher pH. Keep going just to check if there's anything else. This was the initial plot we started with of initial tint strength, and then the last one is the paint viscosity, where everything looks pretty similar. Really quickly using column switcher, I found out that not only is Lab B making resins with higher tint strength, it's making resins with higher conductivity and higher acidity, higher pH. What could that be telling us? What might be causing higher pH and higher conductivity? Well, these resins, I said, were a dispersion of a polymer particle in water. Anything that's changing the conductivity, the pH is going to be in the water phase. It's not going to be in the resin phase. What we did was we precipitated by centrifuge, we precipitated out the resin and just analyzed the water phase. We carried out a lot of analysis, but one of the things we worked on, I'm showing on this plot on the right-hand side of PPMs, of parts per million of phosphorus and sulfur in that water phase. If I look at the orange bars to start with, the sulfur, I can see all of the resin s from all three labs are very similar, but the blue bars, the level of phosphorus, Lab B is making resins with about four times as much phosphorus as they were making from Lab A and Lab C. When we looked at the recipe for making these resins, there's only one raw material that brings in phosphorus. On a bit of further investigation, what we found out was the supplier that was delivering this material to Lab B was mistakenly supplying something that was four times as concentrated as it should have been, and four times as concentrated as they were providing to Lab A and Lab C. The auxiliary data that we looked at in this DOE using the column switcher, we were able to really quickly pinpoint the cause of that problem. We didn't have to expend time to get there. The project stayed on track, and there was even a bonus. We learned that increasing the level of this material with the phosphorus was another tool we had to increase the tint strength. We would have probably never been aware of that if we hadn't carried out this analysis and had this happ y accident. That's the first example. We go on to the second example now. In this case, we're dealing with a protective coating, a coating that's designed to go over metal substrates like iron and steel and protect from corrosion . We have five experimental resins that we want to look at, a resin that's designed to give good corrosion, and then four resins that are designed to improve the flexibility of the coating. The first three of these resins are added in the first stage of the coating preparation, and then the last two get added in a separate later step. We have two questions here we're trying to answer. How do the resins affect corrosion and affect flexibility, and what is the best combination of the levels of these resins to give us the best combination of corrosion and flexibility? Again, we use DOE, we were able to build predictive models, and here we were using the mixture profiler to identify some white space that we can work in. This DOE is a little bit more complicated than the first one, so I'm trying to represent pictorially what we were dealing with here. If we look at our first stage of our coating manufacture, in addition to our three experimental resins, we have a main resin at a fixed level. In effect, our three experimental resins are three mixture variables here because they form the rest of this 100 %. They add up to a constant sum of 57.77 %. Three mixture variables dealing with there. Then in stage two, we can deal with our other two resins as independent variables because they're not part of that mixture. We have three mixture variables, two independent variables. We also have some levels that the formulators were able to decide they wanted to work in based on prior experience for all of these resins. Then we have some constraints on the combinations we're dealing with. For example, at the start here, we want the sum of Flex2 and Flex3 to be more than 10% but less than 30%. There are some other constraints as well. A fairly complicated DOE, but using custom design, it's relatively straightforward to build this DOE. Definitely some tips and tricks in terms of how to build the DOE, what model to use, and how to analyze that data. I don't have time to go through that today, but I'd be perfectly happy to talk about that offline if anybody's interested. But let's go straight into JMP and we'll look at this example. Here we have the DOE that we carried out. It was a 16-run DOE. If we go right across to the left-hand side, we have our three mixed variables and our two process variables. We've measured our flexibility and corrosion and then we have a lot of other auxiliary responses we've measured. I was able to build good predictive models for flexibility and corrosion. What I'm going to do is just show you those models in the profiler just to help us understand what we're learning and what's going on. I'll add those two predictive models that I built to my profiler, and then I get my profiler here. I can see, first of all, I'm plotting flexibility and corrosion here. Lower numbers are better for both of these responses. Lower numbers for flexibility, lower numbers for corrosion are what we're targeting. I can see as I add my corrosion resin, if I increase the level, I get better corrosion performance, but unfortunately, I get worse flexibility. The opposite is true for most of these flexibleised resins. As I add more of these, I'm getting better flexibility, but worse corrosion. This is something that's very common in coatings development and lots of other areas. Seems like there's always a pair of properties where if we improve one of them, we always make the other one worse. But if I come across to my Flexabiliser 4 resin, something really interesting here, as I add more of this resin, I get better flexibility, but I don't suffer at all in terms of corrosion. This is going to be a really useful tool for us to optimize the combination of flexibility and corrosion. But I'd like to understand a bit more about the science behind this. What's happening? What's unusual about Flex4 that allows us to improve our flexibility without degrading corrosion? Again, I want to use all of this auxiliary data that I've gathered in my data table to help me understand that. What I did is, I want to look through this table, and I'm going to use a different tool this time. I'm going to use multivariate. If I select that, this allows me to basically look at the correlation between all the combination of factors that are in my data table. I'll select everything that I measured and I'll add it in the Y columns and just hit OK. This generates my multivariate. The first thing I see is this table here where I've got all the correlations for all the pairs of combinations of the factors that are put in my table. I can see there are some pretty nice correlations here. I'm seeing some fairly strong correlations, but it's a little bit difficult to go through all this, a bit overwhelming to go through all this and pick out any interesting patterns. I've also got my scatter plot here, and if I add a fit line to these scatter plots, again, I'm seeing some fairly strong correlations, but still I think this is a bit overwhelming to dive straight into. The tool that I like to use to start with here is pairwise correlations. If I select that, this generates a new table where I've got all the possible pairs of variables and it's giving me the correlation. I can sort this table based on any column. I'm going to sort by the significant probability and I'll make it a scending because I want my low significant probabilities to be at the top of my table. Then if I hit OK, I can see the first and strongest correlation I get, in fact, involves this Flexibiliser Resin 4 that was giving us this interesting behavior. I can see a strong correlation with the secondary or TG2. This is a glass transition temperature. The glass transition temperature is a temperature at which a coating changes from being a glassy hard material to a soft rubbery flexible material. My Flex4 level is correlating with here a secondary glass transition temperature that I'm measuring. And I can see also if I go a little bit further down, my primary glass transition temperature, the main glass transition correlates strongly with the corrosion. S cientifically, I think they're interesting observations. What I did based on that is I also built predictive models for my primary TG and for my secondary TG. Now I can look at my profiler, but I can include all of my four predictive models. Now I'll include the two I did before, flexibility and corrosion, but also my primary TG and secondary TG. Now what I can see is that the first two rows are exactly what we were looking at before. If I look at my primary TG, I can see whatever I do in terms of adding new resin. For example, if I add more of my corrosion resin, I'm increasing my primary TG, and that's correlating with an improvement in corrosion. The flexibilising resins, if I add more of those, I'm decreasing my primary TG and making my corrosion worse. That primary TG does seem to correlate, as the multivariate is showing, correlate very well with corrosion. If I look at my Flex4 resin, it was having no effect on corrosion and it's having no effect on my primary TG, so it's different from my other flexivising resins, but I can see for my secondary TG, as I add more of my Flex4, it's rapidly decreasing the secondary TG. The other resins really don't have much effect on secondary TG. What does that mean? What can I learn from that? Well, any material that has multiple TGs, glass transition temperatures, it's usually a sign that it's a multi-phase raw material. It's not a homogeneous material. That was the case here when we did some microscopy. What we saw was our coating had a continuous phase shown by this gray material here, but it had dispersed in that a secondary phase. The primary glass transition temperature was correlating with that primary continuous phase and the secondary lower glass transition temperature was correlating to this secondary phase that we have here. We had a hard glassy primary phase and then a soft rubbery secondary phase. Why that's important is usually high glass transition temperature does lead to better corrosion because it inhibits the diffusion of anything through this layer and stops material getting to the substrate, the metal substrate, and causing corrosion. Usually, if I want to make flexibility better, I have to make this continuous layer softer and that degrades corrosion. But with this type of morphology, I was able to keep my hard continuous phase and gain flexibility through a separate dispersed rubbery phase. This meant that anything that wanted to diffuse through the coating and cause corrosion was always having to diffuse through this high TG area. It's given me the combination of good corrosion and good flexibility together. The auxiliary data that I gathered was really responsible... The analysis to that was responsible for the learning of what was going on in this system. In conclusion, it's definitely possible to carry out successful DOEs where we only measure the critical responses, the big Ys. But I hope I've shown that including carefully selected auxiliary responses, little Ys can often be really valuable, can bring clarity to unexpected results, and it can help us to build scientific knowledge. I hope I've also shown that JMP provides some tools that really help us with this. I've shown a couple, but there are many more that are available. I'd finally like to finish off by thanking the many associates at PPG's Coatings and Innovation Center who contributed to this work.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

This presentation demonstrates the use of JMP for a practical real-world application in high-throughput analytical chemistry; it also highlights the power of even basic JMP functionality to reduce costs without affecting data quality. This talk provides a compelling introduction to the synthetic biology industry, which is at the cutting edge of scientific discovery, aiming to revolutionize how to manufacture chemicals sustainably. At Amyris, hundreds of thousands to millions of different strains are created each year. Each strain is screened using various analytical platforms to identify improvements, allowing us to link genetic coding to specific phenotypes. This scale pushes the boundaries of today’s automation and analytical technologies, consequently challenging the scalability of conventional approaches to analytical chemistry. This talk highlights not only the use of JMP as a powerful tool to make a complex analysis easy but also demonstrates the use of the workflow builder to enable general users to perform these analyses. A business case is shown in which JMP helped analyze and improve analytic calibration protocols in the lab. Today we're going to be talking about using JMP, our favorite software and applying it to a real- world problem in our analytical sciences department at Amyris which is a symbio company. Before we jump into that, I wanted to introduce myself as well as Scott, who helped me along this journey. I'm Stefan, I'm an associate director of R& D data analytics A myris. I have twelve years of industry experience, a lot of diverse background, I've worked in various labs from analytical chemistry to fermentation science and in more recent years focused more on the quality and data science side of things. Scott has helped me in a lot of the content here and has been working with Amyris for a number of years and he is one of the JMP pros working for JMP. I'd like to start off by just saying thank you to Scott for helping us out here. We're going to split the talk today into three parts. I'm going to give a bit of background and context both on synthetic biology, if you haven't heard of that before, and analytical chemistry. The main part of the talk is really going to be focused on then applying JMP to a specific question we had and then finally we'll wrap it up briefly touching on automation and then the impact of the analysis and this case study we'll look at together. S ome of you may not be familiar with synthetic biology or analytical chemistry and I really like to understand context and background and is going to be relatively important for the case study we look at, we'll focus on that today and start there. S ynthetic biology really leverages microorganisms, as we like to call them, as living factories. We use mainly yeast in the case of Amyris that we precision engineer, and we use the process of fermentation, which is not a new thing it's something people have been using for thousands of years, mainly to make alcohol and bread in a lot of cases. In our case, we're using the yeast in fermentation, feeding it sugar and converting that sugar into a variety of target ingredients and chemicals. Those ingredients and chemicals we can then make higher purity so they may be higher performing lower costs and in a more sustainable fashion. To give an example, this isn't just a fairy tale. This is reality, it's not an idea, we have 18 molecules today that we manufacture at scale and I'm showing a subset of those here. There's an example on the top left you have Artemisinin. That's an antimalarial drug, it was our first molecule and that's how our company was founded. In the top middle, we have Biophene which is actually a powerful building block that we then convert into other chemicals and applications. One example being Squalene which is a very popular emollient used in the cosmetics industry and traditionally is sourced from shark livers and one that might be familiar in the bottom middle, we have patchouli. Some people associate that with the hippie smell, it's a fragrance, but it's actually really ubiquitous in the fragrance industry as a base note, so it goes into thousands of products. Things like Tide detergent have patchouli in it and we can manufacture this, which is traditionally extracted from plants with our synthetic biology platform. I work in the R&D function, and so our goal is really to identify the best E strains that we can then use at manufacturing scale, and that requires research at scale. We run highly automated a lot of high- throughput workflows at Amyris in Emeryville, and so from the left there we start screening our yeast strains at a capacity of about 600,000 strains per month. We take those top performers and we promote them to what we call our bench- scale bioreactor fermentations, which you can see pictured on the right there. Throughout all of this, we're creating a lot of strains, which means we also need to understand what's happening in those strains, what are they producing, how much, and that's really where analytics come in. Those analytics need to be run at a scale to match that so we can really get the data to understand what's happening. With this scale of research, there's a lot of opportunities, and a lot of those opportunities come from looking at conventional approaches and reconsidering how to do those. I will talk a little bit about analytical chemistry. Again, that's not anything that's unique to synthetic biology. It's pervasive in a lot of industries, petroleum industry, environmental sciences, pharma, very common way just to measure things. I'll talk here really about chromatography, and as an example, I'll take fermentation that we do on the bioreactive scale. From this fermentation, we're going to sample that while it's running. We're going to get a dirty sample from that which we then can further prepare and dilute. We have this mixture of components in this final form. We'll then take this mixture of components, we'll run it across some separation techniques. That's a chromatograph. What that's going to do is based on the property of those components, might be size, it might be polarity, it'll allow us to separate those out. We then feed that into some detection mechanism. There's a variety that you can use and what that gives you is a separation of these components over time and then some intensity of response. The last piece and where we're going to focus today is intensity isn't really a useful thing for you or me to make decisions on. We need to translate that into something useful like a concentration. The calibration curve allows us to translate that intensity into a concentration, and of course, you can imagine if you get that translation wrong, your data is going to be wrong and it's going to mislead you. Calibration curves is where we'll focus today, and that's the heart of the question. A calibration curve is created, by running standards with varying levels of your known component. The example I'm showing here, we have a low mid- high, so a three -evel calibration. We know what the concentration is in those because we prepared them, and we measure the response on these instruments. From there, we can fit some calibration curve. In this example, I'm showing just a simple linear fit, and then we can run unknown samples, read the response off our instrument and do an inverse prediction. We're taking our response from the Y and predicting what the quantity is in that sample. It's a very common way to be able to quantify things in unknown samples. That's our background. We're going to jump into the case study looking at this key question we had around optimizing a part of our process in our analytics. A bit more background here is that when we do calibration in our labs, there's a cost associated maintenance of these calibrations and calibration curves and calibration standards is expensive, both due to people's time, but also materials. These materials can often cost thousands, even tens of thousands of dollars per gram. With the scale that we're doing our research at, it really pushes us again to reconsider those conventional approaches. We're running millions of samples per year, and we have a really diverse set of analytical methods so we have currently in our lab in Emeryville, over 100 different analytical methods measuring all components. One place we looked at is conventionally. We see this with most people we hire, this is where people start. Conventionally calibration curves often have five to seven levels, whether they're linear or not. We think about they say, okay, five to seven levels, linear fit. In theory, the most you might need is or the minimum you might need is two and there's a cost to each additional level, both in materials and preparation costs and maintenance. This is where we wanted to look and ask the question, look, can we actually reduce this number for an existing method without significant impact on our actual data quality? The way we quantify our unknown samples. This is where JMP comes in, we're going to use Jump here to simulate some alternative calibration schemes, in this case, reducing the number of levels of calibrations and to reiterate what we've walked through our problem ultimately is that calibration maintenance is costly. That's exasperated by the scale we do it at. Our general approach is really going to be to look at how can we optimize this. Let's look at reducing the levels of those calibrations, and then our specific solution is using JMP here to ask the question, look, if we went back in time and if theoretically, we had run two calibrators or three calibrators instead of six or seven, how would that have impacted our data? Our case here, we're going to focus on a single method today. This is a real method we've been running for about six months. We have 22 batches of samples we've run on this method, so it's about 1000 samples. Our existing calibration I show here on the right is a linear calibration. It has six levels and we've estimated if we can reduce this to the minimum of two levels, we could save an estimate of $15,000 a year. There's a real measurable motivation to understand if we can pursue this. Showing here the general workflow that we came up with. I'm going to go through this really quickly right now but no worries. We're going to walk through it step by step together. We're really going to just pull the historical data. We're going to recreate our historical calibration in JMP to validate what we're doing in JMP matches what we've done historically, and then we're going to say, okay, let's eliminate some of these levels, recreate the calibration with those reduced levels, and then evaluate what impact that has on our targets. Now, I think in this case it's also really important to emphasize you see, we have two pass- fail forks in the road. Often when we're doing analysis on data in hand, we're looking for statistical significance with studies like this, it's really important to determine what your practical requirements are. In this case, what does that mean? We're talking about impact on the measurement of unknown samples. Ultimately, we want to make sure that reducing the calibration is not going to bias the measurement in one way or the other. We want the measurement to be the same. As many people will tell you, the same is not really a quantifiable thing, it depends on your sample size, the noise in your process. We need to define what is no different, same or no impact mean. Here we're going to set our acceptance criteria ahead of time for this first step as accuracy within half the percent, and for the second step as accuracy within 1%. We'll see these come back as we walk through this. Our first step, and every page here, I'm going to show in the top right what step we are in the process, as well as highlighting what JMP platforms we're using. For our first step, we're going to be pulling our historical data from a database, in our case, we have a database. We have a Lin system that already has the data in a structured format. You could also import this from CSV, however, you can access the data. We're pulling it in our case using raw SQL and JSL and it pulls in a structured format showing a subset of the columns we have, but what you'll notice is in this case we have our six calibrators as well as a number of unknown samples. We're pulling in the historical data as the core data set we're working with. The first step is recreating and validating the same calibration curve so that same six- point calibration in JMP. Now, you might ask why we have to do this. There are two main reasons. One is calibration curves can have a lot of caveats. They can have weighting, they can have anchor points, they could be forced through zero, they could be nonlinear. This is a good way to validate that you're using the right parameters and JMP to recreate this. The other reason is that we don't expect these values to be exactly the same. The reason being that a lot of these analytical software uses some proprietary regression that is not exactly like, let's say, ordinary least squares regression. To do this, we're going to use a specialized modeling fit curve parameter and really just recreating our calibration curve. Just like I showed earlier, where we have our known quantity of our six standards on the X and our raw intensity or signal response on the Y. In our case, we have 22 batches, I'm not showing all of them here, but we're reproducing this for 22 different sequences in essentially one click and what I call the power of the Control key if you don't know this trick. Will save you a ton of time, if you hold down the Control key, click on the red button, whatever you do is going to apply to every analysis in that window. Recently learned that's apparently called broadcasting, so you could use that as well. We're recreating a calibration curve for each of our batches and then in the same specialized modeling platform, we're then saving the inverse prediction formula. Because we're predicting from Y to X, if you remember back to our calibration intro, to be able to save the predicted values back to our data table. This then looks like this where on our data table we have first our historical quantity, what we pull from the database, and now we have our raw quantity that we generated from these newly created calibration curves and JMP. We have a multiplier we have to apply, do the sample prep we do that we pull from the database so that's already there and it's going to stay constant. We simply need to just apply a calculated column here to have a comparative value to our historical data. If you look in this first raw, our value is very close to but not exactly the same as our historical data. Next up, we're going to visually do a comparison, plotting our historical against the JMP recreation of that calibration, and this is a good check again, to look through your data. What you would expect or hope for is a line that essentially looks like Y equals X. Now we don't want to stop at a visual analysis. We of course, want to bring some statistics into it. This is where we introduced the passing Bablock regression. It's actually something that was just added into the base jump functionality. I think with JMP 17 used to be an add- on for a long time. I'm glad it's there now. This is a specialized regression that's non- parametric and robust outliers, that's really designed specifically for comparing analytical methods. For many of you, probably irrelevant you're never going to have to use it, but we need to use it in the world we're working in. What this regression does, it gives you two hypothesis tests to test for constant bias as well as proportional bias. S tarting with a constant bias, where we're seeing if there's bias. Imagine the line moving up and down the same across the range. We're evaluating if the confidence interval of our intercept does or does not include zero. For proportional bias, where the bias would change based on the response. We're evaluating if the confidence interval of our slope does or does not include one. Now in our case, we reject the null hypotheses in both of these cases, which tells us that we do have statistically significant bias, both constant and proportional in our data set. From here you might say, okay, we're done there's bias we can't move on, but thinking back, this is why it's really important to define what the practical significance is because any statistician will tell you in our data set we have 1000 samples, you have 1,000 samples you're going to be looking at very tight confidence intervals. You're going to be able to detect very small differences. We have a statistically significant difference but does it matter? That brings us to our last step we're going to calculate, again using the column formula, the relative difference between the two methods and I'm showing a distribution of that below and that distribution then gives us access to this test equivalence. This allows you to test a distribution of values against the constant that you define within some confidence. Here in this window, we'll enter our target mean is zero because we hypothesized that they're going to be the same so no difference. Now we get to enter our acceptance criteria, which was 0.5%. This gives us this very nice output with our final two hypotheses tests where if we reject these, we can determine essentially that the mean of this data set is equivalent to zero within plus or -0.5%. This one you might say, hey Stefan, this is doing a t- test, your distribution is not exactly normal and I think you'd be right and if I went back I might actually use the test mean platform because that gives you access to non- parametric equivalence tests. Regardless, this is a really useful and direct way to test for practical significance. We've pulled our historical data from the database, we've recreated and evaluated the calibration curve and we've established that it passes our acceptance criteria. If it had failed, it could be an issue with the data set. You might not be using the right calibration parameters. There are a number of reasons, we generally would pretty much always expect this to pass. It usually just requires some investigation to what's going on in the way you recreated this calibration. Our next step is down- sampling or reducing the number of levels of our calibration. Now, if we try to do this without JMP, we have to go into every single sequence in our analytical software manually remove calibrators, recalculate things. Be really long and tedious thing. JMP this is as easy as just using the data filter. In our case with this six- point calibration we have a linear one. We know that the minimum number of points we need for linear fix is two. We're picking the highest and the lowest calibrators and just filtering down to those. From here I'm going to go pretty quickly, but really all we're going to do is recreate this calibration with two points in JMP. Again, we're using the specialized modeling platform, doing a fit line. The only difference now is we have two points instead of six and we're applying that inverse prediction formula back to the data table, which again is going to give us our inverse prediction and then we apply the multiplier and because I know I'm going to do the practical significance, I'm just going to preemptively calculate a relative difference between the two- point calibration and then the historical difference. Again, we go through the Passing Babock and not so surprisingly, again considering the size of our data set, we're going to reject the null hypothesis here and establish that we have statistically significant bias, both proportional and constant. We move on to test our acceptance criteria. Remember back now our threshold is 1% instead of 0.5% and that's working with the stakeholders of the data to establish what is an acceptable equivalency. That's always important pre- work to do and we're going to test that equivalency. Here we find that these two methods are equivalent within plus or minus 1%. On the unknown samples and that's really important. We're using those historical real- world samples to really ask the question what if we went back in time and reran all these calibrations with two points and reported the data of these unknown samples. How would those values change? On average we see that they change very little, and so it gives a lot of credence to considering inducing those calibration levels. We've essentially demonstrated this now and so this calibration on the left and the calibration on the right we're saying are equivalent , aren't going to provide equivalent quantitation within 1% and so we have essentially the evidence we need to push for this change. We passed our first check, we reran the evaluation with the two, we passed that and now we're at our final step of implementing those changes in our process. Now it's arguably the most important part, if you do an analysis, we just leave it sitting there, doesn't do much good. This can sometimes be the hardest part. You have to go out, you have to convince people, especially in cases like this and you have to take consideration of maybe are there additional things that this analysis didn't consider. I'm happy to talk to anyone about that, but we're not going to go in depth of what the other considerations we have to think about before putting this into action, but with this example, we did actually end up reducing calibration levels from six to two and that reduced the annual cost of running that method by about $15,000. From there we might say, okay, what now? Are we done? Of course not. Right now we need to look at we did it for one method, we have a suite of another 100 plus methods that may also have these many level calibrations that might be overkill for what we need. We want to look at repeating the analysis for other methods. That's where I think automation comes in. It is a really great way to scale this one- off analyzes for ourselves, but also for others. My rule of thumb is if I find myself doing an analysis more than two or three times. Let's build that out in the automation, say future me a lot of time, spend a little time now. I'll just touch on this very briefly and I want to shout out Scott here for helping me with a lot of the workflow builder work and the scripting but these native automation tools in JMP are really powerful and they're very user- friendly, there's a lot of code- free options and so there's really different ways you can do this. You can do it on the left side, right in a classic way doing all the scripting even allows you to save these global variables so it could give us place for you to have users enter their acceptance criteria which might change and or you can leverage the workflow builder which is a bit of a newer feature but really lets you build out this automation. Even if you just want to script it raw, you can build the framework that you can then flesh out. The two things I will say about this is how much you can automate or maybe how much effort you have to put into it is going to be limited to some extent by how rigid that workflow is. If users need it to be really flexible, need to interact with it, it could become very challenging to automate, and of course the data consistency is key as well. This is really a great tool to help others reproduce the analysis, but you really do have to also train them and document the work, make sure they know what it's actually doing. As we all know, every analysis has its caveats. You need people not just to click and have a report, but also understand a little bit like what are some potential things that could come up, especially if you're trying to future- proof of work. I like to bring it back together and wrap it up there and hope today that I've showed you that JMP you don't have to do like crazy complex or sophisticated things in JMP you could piece together a lot of simple functionality to create really impactful workflows. Whether you're working in a lab at your organization, wherever it is, look to identify these improvements in existing workflows. I like to think about if you all are in the experience most of us are in, there's more data than what we know what to do with. Look at the data that no one is looking at and then challenge the conventional thinking. The way we're working is always changing, ask why do we do it this way. In our case, for a long time, this is the way we do it. Five, six- point calibration. Ask why, what if we didn't, what would the impact be. Of course, don't have to tell anyone listening here. Use JMP for the scalable analysis and then use automation to make it easy, and it really doesn't have to be fancy. It just has to work for what you need it to do. Finally, you can use that to impact these impactful, implement impactful change and use data to drive those decisions. It's probably one of the most convincing tools that we have today. If you're talking to management, do it in units of dollars because they love that. I'll wrap it up there I think last thing I'd like to say is just a thank you to the JMP Discovery Summit Committee, all the people organizing special thank you to Scott for all the help he gave me in the past with Amyris, but also with this talk and this analysis, and then a number of people at A myris who were involved in with this and with that, I will wrap it up. Thank you. Bye.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

Several authors have addressed validating statistical software. More work is needed to assess the usability of such software since modern statistical software is increasingly in the hands of users with limited statistical training to address complex real-world problems. This poster presents a usability case study for a design of experiments tool in JMP. The study focuses on a dialog used to specify a designed experiment. Such specifications require users to identify factors, responses, the linear regression model, and constraints. The cost of conducting experiments is usually a primary concern, so users typically iteratively refine specifications before experimenting. As a result, the ability to easily specify and change specifications is essential to users. We begin with the challenges of the current dialog related to discoverability, clarity, and tedium when performing everyday tasks. We present details of the usability study, including dialog prototypes that address various challenges. We discuss how these prototypes were dynamically generated with the JMP Scripting Language and how the usability of each prototype was assessed by using simple and complex tasks. We discuss the variables and regression model used to assess the potential effect of each variable or combinations of variables. We also explain how we used JMP Custom Design to determine the best configurations for each subject, with the order of the configurations accounting for the subject’s learning over the study, as well as how qualitative data were collected by using an integrated questionnaire that was presented after all tasks were completed. Hello. I'm Mark Bailey. I'm here with my colleague Jacob Rhyne, to talk about some recent work on the new user interface in JMP Custom Design. The primary focus of our work was a usability study to evaluate the new interface. First, I'm going to talk a little bit about the current interface and our motivation for the change. We think of experimentation as a workflow. In the very first step, we define our factors, responses, and goals. This critical step determines so much in what follows. Adding factors is a very important step when you're designing an experiment. We learned that customers have some difficulty with common tasks around defining factors. There's confusion about adding, deleting or renaming factors. They find it difficult to change the factor type, the factor levels, even reordering the factors in the list. Figure 2 shows the original User Interface for comparison. The goals for this prototype were to decrease confusion, to reduce the number of operations in the time that it took to achieve the result. Our usability study wanted to compare different controls that might be used to achieve these goals. Figure 3 shows the prototype for the new user interface. This prototype is dynamically configurable. That is, we can, that will turn different user controls on and off. This led to a prototype with 384 possible configurations. It was impossible to include all of them in a usability study. A practical study required a smaller, optimal design. Looking at figure 3, especially along the top, you see the User Controls that we're going to primarily focus on in our usability study. You see the Trash icon, the Delete icon, the Plus and Minus buttons, the Undo and Redo buttons, the Add N Factors control, and what we call our Buffet menu button. In figure 4, you see just two of the 384 possible configurations. In order to make sure that the usability study was efficient and effective, we used Custom Design to make most of the runs for our study. In this case, some of the User Interface controls were treated as a binary factor. That is they were turned on or off in the prototype. Other User Interface controls could have more than just two states. Each time the subject was asked to perform several different tasks repeatedly with different configurations for comparison. In the table in the lower left, you see all of the factors that were included in our study. The first six, seven, have to do with controls in the User i nterface that were turned on or off or to a different state. The last factor is the one that determined what we wanted them to do with a particular configuration. That is we asked them to delete a factor, change the type of the factor, or change its position in the list. On the right in figure 5, you see the completed Custom Design platform with the response defined. We're going to measure the time it takes for them to complete each task. We're trying to minimize that. All of the factor definitions, and finally, the model that we want to use to quantitate the time to complete each task. I'm now going to turn it over to Jacob. Who's going to talk about the resulting Custom Design, how that fit into our study, and our findings. Thank you, Mark. Mark has laid out how we use the Custom D esign platform in JMP to set up our usability study. In figure 6, you can see, all the runs in the study, for the first participant. I'll note that we did include a participant and Run Order in our design as covariates because we wanted to include these in our analysis as well. You'll also note from figure 6 that we included a few, what we call, special tasks in our study. These were tests we knew would take longer to complete, and we didn't want to ask participants to do these more once. If you look at the final task of the of the study, these were, special exercises that we ask the participants to complete a series of tasks using a provision of the prototype and the same series of tasks using the Custom Design platform. This allows us to get a direct comparison between the new configurable prototype and Custom Design. To apply this Custom Design, we developed a custom JSL application that allowed us conduct an unmoderated usability test. Now I'll give an example of how this works in figure 7. From the top of figure 7, you can see that, for Run Order 2, the task was to delete. As you can see in the second part of the screenshot, the instructions the participant saw in the study, instructed them to click the Launch Prototype button and then delete the factor charge. When the participant clicked the Launch P rototype button, the version of the prototype that appeared here in the third part of the screenshot, that's entirely determined by the other factors in our study. For example, you could see the Trash icon is set to off. That means you don't see the Trash can icon, down here in the prototype. The Delete icon is on, so you do see the Delete icon. For every exercise, in our study, the version of the prototype that appears and the instructions that the participant would see is entirely determined by our Custom Design. Let's look at some of the qualitative results from our study. In our usability, study scripted we included an integrated questionnaire at the end where we ask participants to get us some feedback on what they liked, what they didn't like, et cetera. I'll highlight, a few of the options in figure 8 here. We ask participants to give us feedback on whether they liked using the Plus button or the Add N Factors control to add factors, and you can see from the top of figure 8 that participants overwhelmingly preferred using the Plus button to the Add N factors control. We also ask participants how they liked, Deleting factors, either using the Delete icon or the Trash icon. In this case, the Delete icon was overwhelmingly preferred. We also asked participants, what was their favorite Row icon that you could see beside the rows in the factor list. The Pin icon was the most popular with, giving five of the 10 votes, compared to the Oval icon, which only got two of the 10 votes. Then, finally, I point out if you look down towards the bottom of figure 8, you can see that the participants overwhelmingly voted that the User I nterface of the new prototype was easy to learn. It was the new interface was also greatly preferred, to the original factor listing Custom Design. Now let's look at some of the quantitative results from our study. We got these results by using the generalized regression platform available in JMP to fit models for the time it takes to complete the task, and the factors in our model or the effects on model, excuse me, or the different elements of the prototype. If you'll look at figure 9, for the output of our generalized regression, you can see, that the factors, Row States, Trash icon, and Pin C olumn Type were found to have a significant impact on time. You can also see, from the profile or the Run Order, in this case, did not have a significant impact on time. Let me draw your attention, to the Pi n Column Type F actor. You can see that the time to complete, the task of changing a factor order was minimized when the Oval icon was used to complete the task. I'm pointing out this Pin Column Type Factor specifically because in this study, the way participants would change the order of a factor is they would click the Row icon beside the factor in the factory list and drag that icon to the appropriate spot. In this case, the Pin Column T ype is what I wanted to focus on. Looking at figure 10, you can see that, participants, when they were asked to tell us their favorite, only two of the 10 participants reported liking the Oval icon. When it came to completing the task, that participants interacted with the icons the most, the Oval icon was actually the best- performing icon. Now let's look at the results, for the task of deleting a factor. In this case, participants would be asked to, given version of the prototype, remove one factor from the factor list, and they could do that by either using the Delete icon, the Trash icon, or they could have the option to use both. Again, we fit models using a generalized regression with time as the response, and you can see that, the icon was used and Run OrderThe bot had a significant impact on time. The time, for this task was minimized when the Delete icon was used as opposed to using the Trash icon or have any option to use both. In contrast, to the previous slide, our quantitative results here match the qualitative results because as you can see in figure 12, the participants overwhelmingly preferred the Delete icon, to the Trash can icon. I'll end this study by commenting on the results of the last two exercises. The last 2 exercises in the study, we had we had participants complete a series of exercises using the prototype and then complete the same series of exercises using Custom Design. The instructions for the exercises were the same. The only thing that was different is in one case you use the prototype, and in the next case, you use Custom Design. When participants were giving the prototype, it took them an average of 68 seconds to complete the task. When participants use Custom Design platform, it took them an average of 316 seconds to complete this exercise. I'll also note this 316 seconds is right- centered because a couple of the early participants that we gave the study to reported that they gave up. A fter this, we started popping up a notification after five minutes saying, "We thank you for completing this exercise. You can move on to the next one." That's all we have to share today. Want to thank you for your interest in our poster, and please let us know if you have any questions.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 16, 2023

This project studied JMP Black Belt exam patterns among students to understand exam difficulty and discrimination across platforms in JMP 16. Through preliminary scoring distribution, the instructor can identify each exam question's difficulty and discrimination. The statistical parameters can then be used to determine which questions should be included in which level of examination, such as Green Belt or Black Belt. By comparing three different Black Belt STIPS exams with a dual plot, questions can be designed for different types of exams. The characteristic curves are related the difficulty and discriminatory properties of each question and are then compared with each student’s ability and probability of answering correctly. Dual plots use the same scale to show the relative difficulty for a group of students' capability when answering each question. Parameter estimates are used to measure the difficulty and discrimination of each question. The sample questions in three STIPS exams use four kinds of questions and four levels of difficulty and discrimination. By using JMP Item Analysis platform, instructors can adapt any learning program and their instruction to accommodate different levels of students according to the difficulty and discrimination of test questions. Hello, everyone. My name is Kemp Wu. I come from Taiwan, and I work in Applied Material for Continuous Improvement Group. Today, I will present BB JMP Exam Item Analysis. First, let's do the project adjusted in 2023. The entities sub way used in GB/BB training course is changed from mete r to JMP. To modify the AGU, GB/BB exam questions based on JMP survey is necessary. In this project, we use Item Analysis to study the 2022 BB JMP exam questions to support this GB/BB exam questions change . Item analysis will help us to identify the difficulty and the discrimination of each questions. For GB we need easier questions because it is data label requirement for the beginner. We won't want too many hard questions inside GB exam. When we move to BB exam, the questions should be harder and better discriminated. For the last year, for the three chapter of the BB JMP training programs, we have, step one, is the Intermediate Statistics and step two, for the Graphical Entity and step three is MSA, PCA and SPC. Based on that, we would like to identify each questions difficulty and discriminations and by day, if any opportunity to modify the training material is sent questions based on the NAC results. Also the capability of the questions to judge the ability of the trainees. This is the last year field JMP insignation results. For the step two, is the Graphical Entities. It's very basically for the reading, also the drawing for the users. For the step three, is the MSA, PCA and SPC. This is more advantaged. For the best one, we got a more higher score for the average. Then the advantage f or the statue, the score is more lower. If we want to be used the based on the score, it's very difficult to separate for the GB or the BB, the questions. That's why we come out. We need to use the JMP Item Analysis. They can deliver more information for us to do the data analysis for such as the questions' difficulty or the discriminations. How to use for the JMP Item Analysis application on the GB and the BB installations. F irst, let's look at the characteristics curve. Characteristics curve control the relationship between the probability of quality answers. The item and the ability for different probability labels on the ability scope also show correlation between for the difficulty and the discrimination attributes. There are two parameters in the used parameters curve. First is the a parameter, is about for the discriminating power. Another one is the b parameter, is about the item difficulty for the left button of the chart. For the y-axis is the probability of the quality of the answers from 0-1. One is meaning for the 100%. For the x-axis, is the abilities from minus 4-plus 4. More higher number is being more difficulty. Also more ability for the center. For the periphery, if the curve is more steepness of the curve it mean is more powerful for the discriminations. For the parameter b is above the difficulty. We can see for this chart, the right hand side of t his curve is more difficulty than the left hand side. For example, the ability for the deal to go up to match for the right side of the curve is about just only about the 10% the corrected. If compared with the left side of the curve, it can go to about 90% can be corrected. Also we look at the example for the question 24, b is the vertical line of the preferred of the discriminations. And the right hand side for the question 12 is for the difficulty, for the terratinous curve [inaudible 00:06:28] curve from the bottom left to the up of the right side. Next we do add the parameter estimated. The parameter estimated, it show the item which is, each of the questions showed by the difficulty and the discriminations. It show by the numbers also by the histogram. For the difficulty, easy one is less than zero and the hard questions is over zero. With the discrimination, it is over plus and minus three. The up is, there's no discriminations. This example for our result for the step one, we have nine questions to show by the parameter estimated for each of the number. Also for this parameter estimated, the number also will be showing in the characteristic curve. We separate for the three group of the data based on the difficulty and the discriminations. For the group 1 is the green zone. It is the easy with no discriminations. We can see for the red virtual line , it is less than zero . Even the question is easy. Also with the fake curve, for this blue curve it means is no discriminations. For another one is the group 2 is the easy with the discriminations. We can see the grey line is less than the zero or another one is a vertical line. This show for the over the blue line and here it can be 100% correct. If left lower than this line, it mean is probably very hard to have the correct. Next one is the group four. Group four is the hard questions with the discriminations. We can see the ability is over the needle. It also showed the virtual line for the blue curve. It says group 1 have no sample so we didn't find out there's any show by the glossary. Next we will go for the Dual Plot. The Dual Plot report contain a plot that show question difficulty and the student ability in the same scale. I t will show on the Y scale for the ability from minus 4- four. There are some three zones. The first zone is this area. This is about the question difficult. More higher it being the question is more difficult and more lower it being the question is more easier. Another one in the middle zone is for the student ability. More higher it being the student have more ability and more lower it being the student has low ability. How to compare with the question difficulty and the student ability? For example, the student 2 to compare with the question 28 because the student is more higher than the question 28. The student 2 he have more probability to answer correct questions. The obvious is for the student 1. Student 1 compared with the question 15. It means the student 1 is very hard to correct the answers. A lso for this one is, the right zone is for the group shown by the histogram is shown for the two group of the student. From minus 2-0 is one group and zero -2 is another group. For this example, it show for the two bimodal more the upper zero is for the G B and the lower zero is for the BB exam . Another one is the group three STIPS descent result to show by the dual plot. First we go for the cell plot to see the reference for each questions. For x-axis is for each questions and y-axis for each of the students. For the red color, it means the answer is right, and for the blue color the answer is wrong. We come to see the dual part. For the step one you show the pattern is… We see the student number 5 is more high performance than others. Step two they have too good of the students. Step three, this group of students all better than zero and also show the students six is not that good, but if compared with the student 1 you have very long difference. It means the student 1 is the outlier. From this we have four way for the improvements. The first is we can base on the ability zero to separate the upper side is for the GB and the lower list zero is for the BB itself. Second, we can base on the discrimination to show a different kind of happen like state 1 we can find the most performance student. For the state 2 we can separate from the people have good study or no study for the same. The third one is the question difficulty such as this question. None of the students can be correctly to answer this question. Maybe there's a material need to be improved or some area we didn't teach for the students. The last one is we also can find who is the student need to have more extra coach such as the student 1, is [inaudible 00:15:05] . Maybe we need to pay more attention on him. All of the essential can be decided by you. You can see from the door part to see if the ascent achieve your expectations. Now I will show you the real case on the issue of the group of the questions. The group 1 is the easy with no discrimination. This is the question of this and this the issue of the session or the answers? For this question the answer is D. It is very easy and simple for the JMP operations and it can be found in the training material. You can see for the cell plot, most of people are say correct and another one is the group 2 is the easy with these questions. Is about for the parallel part and the answer is B, negative correlation. Even this parallel part is easy but if the student, they don't really understand how to read the parallel part, they cannot understand how to say the correct answers. We are going to see the group 3 with no discriminations. This question is hard also in the training material is the guide the way let you to say the answer D. We can see most of people are saying wrong for these questions. Maybe we need to reconsider is there any training material need to be corrected? For another one, the group 4, hard with the discrimination questions. Even that's all the question. It looks very similar and hard to decide. But if we pay attention on the case or making the nose, you will understand the answer is the B. From the item analysis we can be analyzed for the issue of question for the difficulty and discriminations to dive into for good of the questions for the continuing improvement opportunity. According for good of the questions, we can deliver for the GB and the BB proposal. For the GB we will consider for the 30% for the easy and the no discrimination for basically questions for the students and for 60% for the easy and with the discrimination to separate the student who has studied and not studied. Also, we consider 10% for the high- performance student and for the BB we are not considered for the best questions for the BB due to you need to be more difficulty and we consider 30% for the easy with discrimination, not every question need to be hard and also for the 70% for the hard with the discriminations but we not consider for the difficulty and the discrimination and the duration of the questions is for the reference. Of course, it depends on the ascent requirement to adjust the percentage for the ascent. Last I will be talking about for the takeaway learning the JMP item analyze can help to organize GB and BB exam questions based on JMP. The item analyze help to detect difficulty and discrimination of the same questions for instructor identify any gap in teaching efficiency. Also modify GB /BB training material based on JMP item analysis models. Last one it can be proliferate to the internal training programs such as, 8D program or the Lean training. Thank you very much.

0 attendees

0

Event has ended