Choose Language
Hide Translation Bar
--Select Language--
(English) English
(Français) French
(Deutsch) German
(Italiano) Italian
(日本語) Japanese
(한국어) Korean
(简体中文) Simplified Chinese
(Español) Spanish
(繁體中文) Traditional Chinese
JMP User Community
:
JMP Discovery Summit Series
:
Abstracts
All community
This category
Events
Knowledge base
Users
Products
cancel
Turn on suggestions
Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.
Showing results for
Show
only
|
Search instead for
Did you mean:
Sign In
Sign In
Discussions
Learn JMP
Support
JMP Blogs
File Exchange
Add-Ins
Scripts
Sample Data
JMP Wish List
Community
About the Community
JSL Cookbook
JMP Technical Resources
JMP Users Groups
Interest Groups
JMP Discovery Summit Series
JMP Software Administrators
Options
Add Events to Calendar
Mark all as New
Mark all as Read
In Progress
Upcoming
Past
Status: In progress
Status: Past
Status: Upcoming
«
Previous
1
…
14
15
16
…
29
Next
»
0 attendees
0
0
Measuring Change in MAT Prescribing by Project ECHO Participants Using Administrative Claims (2022-US-30MP-1157)
Monday, September 12, 2022
The Institute for Health Policy and Practice (IHPP) launched a new Project ECHO ® Hub at the University of New Hampshire in 2019. Project ECHO is “an evidence-based method using web-based teleconferencing to link specialist teams with community-based sites to help community providers improve their ability to manage complex conditions (Ryer, West, Plante et al., 2020).” The Partnership for Academic Clinical Telepractice Medications for Addiction Treatment (PACT-MAT) was IHPP’s first ECHO. It was formed through collaboration between IHPP and UNH’s Department of Nursing, and its aim was to increase knowledge and confidence among Medication Assisted Treatment (MAT) prescribing providers. To evaluate the effectiveness of the PACT-MAT ECHO, the PACT-MAT team sought to analyze MAT prescribing practices for participants before and after their participation in the PACT-MAT ECHOs. IHPP’s Health Analytics and Informatics team were brought into the project to facilitate data use permissions and aggregate administrative claims data. UNH’s Department of Mathematics and Statistics provided statistical analysis and modeling using JMP. This presentation provides an overview of our experience in using healthcare claims data to measure impact from an innovative model such a Project ECHO, as well as highlights our use of JMP for the final analysis. Source: Ryer J, West K, Plante E-L, et al. Planning for Project ECHO® in NH: The New Hampshire Project ECHO Planning for Implementation and Business Sustainability Project Summary Report. NH Citizens Health Initiative, Institute for Health Policy and Practice; 2020 Hello. Thank you for joining us today for our presentation on, "Measuring Change in Medication Assisted Treatment for Participants in project ECHO Using Administrative Healthcare Claims" We are excited to share how we use JMP as a key tool in our analytic work. Next slide, please. My name is Erica Plante and I am a senior scientific data analyst at the Institute for Health Policy and Practice at the University of New Hampshire. I am joined by Dr. Michelle Capozzoli, Senior Lecturer in the Department of Mathematics and Statistics, also at the University of New Hampshire. Neither Michelle nor I have conflicts of interest to disclose. Next slide, please. Before we describe our work, I would like to provide a brief overview on Project ECHO. Project ECHO was founded in 2003 by Dr. Sanjeev Arora at the University of New Mexico. Dr. Arora is a Physician specializing in Gastroenterology and Hepatology. He was seeing patients with Hepatitis C die at alarming rates because they could not access care for this treatable disease in a timely manner. He sought to bring providers together to form a community of practice where doctors and other specialists can learn from each other. ECHO is an "all teach, all learn" model, and the sessions are often centered around a key issue or condition. The University of New Hampshire launched its project ECHO hub in 2018 and has since produced a number of ECHO programs, including the Partnership for Academic Clinical Telepractice, Medications for Addiction Treatment, or PACT-MAT. Next slide, please. The primary goal of PACT-MAT ECHO is to increase the number of nurse practitioner students in graduate and postgraduate programs who receive waiver training, apply for the waiver, and subsequently prescribe MAT. Secondarily, the project seeks to increase provider self- efficacy in managing patients with Substance- use Disorder. The program developed a learning community that enabled a culture that understood addiction as a chronic disease and was prepared to address the range of issues that emerged during the process of treatment. Specifically, this program focused on all participants becoming proficient and culturally competent in prescribing and treating SUD, as well as enhancing capacity and qualities of services available to patients in their communities through their providers. Next slide, please. After the completion of the first PACT-MAT session, the team was interested in answering some questions about the PACT-MAT ECHO through Claims Data Analysis. But here's some core information about the analytic project. The analytic period of interest was from 2018 through June 2020. This was to capture data prior to and after the first PACT-MAT ECHO session. The project was funded by the Substance Abuse and Mental Health Services Administration ( SAMHSA), as part of 150K 3-year grant. The project's principal investigator was Dr. Marcy Doyle. We were wanting to ask a few questions about the actual ECHO program and how we were able to see if provider practices had changed after actually participating in the ECHO. And the Center for Health Analytics of Informatics at the Institute for Health Policy and Practice at UNH. We're fortunate enough to have access to healthcare, administrative claims, and enrollment data for commercial and New Hampshire Medicaid policies. Therefore, the CHA team was brought into the research project to collect and aggregate the data, and UNH's Department of Mathematics and Statistics was brought in to build models and perform analysis. Next slide, please. And these are the questions that we asked as our core research questions. Did the PACT-MAT ECHO Series have an impact on participants' MAT prescribing practices? And can we successfully perform a case/ control study on providers using administrative claims data? Next slide, please. When collecting health care claims, we included all members, ages zero to 64 who had medical and pharmacy enrollment in the month of interest. PACT-MAT participants self- reported their name, titles, NPI, organization's name and address, and their waiver status. We cross- referenced that data against CMS's National Plan and Provider Enumeration System, also known as the NPPES Registry. In the cases of a mismatch between the self- reported data and the NPPES registry, the self reported data was considered the most up to date and was used for the analysis. Information on our control group was sourced only from the NPPES. Next slide, please. And claims reflect if one or more service lines included an MAT procedure code or drug code, they were also flagged if an Opioid-Related Disorder diagnosis code was found. Medical providers were selected as having billed for MAT if at least one service line included their NPI or one of the MAT CPT codes. Prescribing providers were selected as having prescribed MAT if at least one pharmacy service line included their NPI as the prescriber and at least one of the MAT NDC codes. The case and control populations each had two pairs of datasets. Next slide, please. One pair for each insurance type, commercial or New Hampshire Medicaid. The first date included there the providers NPI and information, as well as total aggregates by month of all providers, patients and claims. Patients with Opioid- Use Disorder (OUD), patients with any medication assisted treatment, and patients with both OUD and MAT. The second dataset provided the same dataset as the first data set... Same data as the first data set, with the aggregation at the member demographic level, such as age, category, county, and sex. No identifiable member data was applied to the statisticians. Now I'm going to pass the presentation to my colleague, Dr. Capozzoli. Thank you, Erica. Once we obtain the data. The data was actually analyzed by Rebecca L. Park. She was a UNH master's student under the supervision of myself and Dr. Philip Ramsey. We received the three data sets. One was the practitioners demographics such as name, National Provider Identifier, title, practice address. The other two were the claims datasets. One for the Medicaid and one for the commercial. So the original data was extremely large for both the commercial and for Medicaid for each practitioner, it tracked each of their MAT patients' history over the study period. So the original thought was try to... Use their pattern of behavior of their patients over time. Quite quickly, it became apparent that this approach was a little bit problematic and also we ran into some privacy laws in trying to make sure that all identifying markers were not available. So what we did was we honed in on several of the variables for the demographics. So we honed in on the National Provider Identifier, their title, and the city of their practice. And then from the claims data, we focused in on the month and year. So this tracks what month and year that we were looking at. The phase of the program was the Pre, so this is Pre before ECHO, Ongoing is during ECHO, Post is obviously after ECHO. So then we aggregated the data and so instead of looking at every single visit, what we looked at were months and we looked at patient totals. So the first, we obviously looked at was the number of patients total for that practitioner during that month, the number of patients with Opioid-U se Disorder, the number of patients who had any MAT, the number of patients with OUD who had any MAT. Further, when looking at the Medicaid versus the commercial care, during the exploratory analysis, it became apparent that we were going to need to focus in on the Medicaid data due to the low patient numbers in the commercial care. So here for example, when we were looking at it, it became apparent; so on average they had one patient per month with any MAT. And so what we were doing was there's a lot of sparse data here and it was just not conducive to trying to fit models. So we focused in on the Medicaid data. Further, we initially had 20 providers, which we had to reduce down to nine providers. And the reason being is that some of the providers had too many months of missing data. Some of them, we had to eliminate because the majority of the months, they didn't have any patients who had MAT. And then further, as we started to fit the models, it became apparent that we also needed a minimum of 10 total patients per month. So we did end up with a very smaller sample size than we had originally thought we would have. The other piece, as we were exploring the data, using some of the tools that John provides, we noticed that the nine practitioners had on average the number of patients or number of total patients differed. So for example, they could have, I think it was between a total range of one to 161 patients. So between that and then looking at the trend over time, so we looked at the average number of patients total and as you notice this, the blue line, you notice that just in general there's an increase of patients over time. We also looked at the average number of patients with the Opioid-Use D isorder and those who had any MAT, and those who were diagnosed with OUD and had any MAT. And if you notice, all four have a similar trend of an increasing trend. So to combat that, because what we want to know is, are the practitioners increasing their prescribing? Not that they are having more in patients over time, we normalized the data by creating a new variable. So what we created was the proportion of patients who had any MAT in comparison to the total number of patients. And the reason that we did use the patients who had any MAT is due to the fact that a diagnosis of an OUD does have a certain stigmatism. And so we felt that it would be better to look at any MAT. So Analysis Considerations. From the beginning we knew that we had a small sample size of practitioners, only nine. The other thing that became apparent through a lot of the graphical representations of the data was that we had a lot of noise. And so in taking this into consideration, we decided to attempt several different approaches. The first approach was basically, your means comparison: ANOVA and Matched pairs. Then we thought about bringing in that time variable. And so we did look at them in several different ways, from just simple Linear Regression, from the beginning. We did consider Structural Equation Models. Dr. Laura Castro- Shilo from JMP had come to one of Dr. Ramsey's classes and given a talk on these models. So originally, when... We were looking at the data, we thought that these models might be appropriate. But it became apparent quickly because of the difficulties that we had that they just were not working for us. So the next, we worked with Segmented Regression. That was chosen because in some previous work with claims data , it was brought up that maybe the Segmented Regression would work well because we had data that was looking at pre and post. So we thought that with our pre, ongoing, and post, that maybe the Segmented Regression would be appropriate. We also looked at Exponential Regression and Logistic Regression. We also looked at their Generalized Regression Models and including just the regular and then zero- inflated Poisson, Binomial, and Negative Binomial Models. And the reason why we decided to look at the Poisson, and Binomial and Negative Binomial is the data is inherent that it has counts and so we thought that maybe these might be appropriate. And so what I'm going to focus on today are the Means Comparison, the Segmented Regression and the zero- inflated Poisson and this is going to be for both just looking at the ECHO group and then the Matched Pairs comparing the control group to the ECHO group. So the first analysis ignores the time variable and we're just looking at averages. What's the average proportion of patients in your pre, ongoing, and post. And just from the means, it is quickly apparent and from the graph that our pre phase is definitely lower than the ongoing and post. We also tested for the... Because we have small sample with practitioners, we did look at the variance and we noted that we did have an issue with non- equal variance and so we did use Welch's test instead of the traditional ANOVA to conduct to see if we had differences between the phases and obviously we do. And then to determine statistically which ones were different, we did look at the All Pairs Tukey- Kramer Test. And it did confirm that our pre phase was different than the ongoing and post, but the ongoing and post were very similar. And so this gave us some... This obviously is indicating that we do have some differences. The ECHO program is making a difference. So this Segmented Regression, as I noted before, was suggested because we do have the three phases: pre, ongoing, and post. And we were able... Dr. Ramsey had suggested that we use a script that is done by David Burnham, by Pega Analytics. If you are interested in this script, the link is here on the website, it gives you the code and it also gives you a very detailed description from line to line, what he is doing in this code. And so what's happening here, when we ran the script, We were able to fit separate regressions to the three different phases as well as get the fit, an R², and test the significance of the slope. And so again, we're seeing this pattern that we saw with the ANOVA. You do see that the pre is definitely lower than the ongoing and post. Unfortunately, the slopes for all of the three phases were not significant. You can also see that we do have a significant amount of variability. So next, what we considered was the fact that we had two types... Or two categories of practitioners; we had those who were nurse practitioners, physician assistants; and then we had the physicians. Even though they were small sample sizes. We decided to see, "Okay, can we see some kind of a signal?" "What is going on here?" And when we looked at the nurse practitioners again, you see that behavior of the pre being lower than the ongoing and the post. And what we noted is that for the pre phase, we are seeing a little bit of a significance even though it is 0.07, it is saying that there is a signal here and we do have something going on. Unfortunately for the ongoing and post, we're not seeing that significance but again we are seeing that trend. What's interesting is when we looked at the positions. And we noted that obviously we do not have... The slopes are definitely not of significance and there doesn't seem to be any difference whether they're in the pre, ongoing, or post. So it seems while it may not be a practical significance, I mean it may not be a statistical significance of practical interest is the fact that the ECHO group does seem to be benefiting this nurse practitioner, physicians assistants group. So the next thing we tried was the Zero Inflation Poisson model. We chose the Poisson Model due to the fact that we had counts and we also chose the Zero Inflation... We did both the regular and the Zero Inflation and we chose the Zero Inflation because we did have a lot of zeros in our data. And so when we did it, you'll notice that we do have this slight trend of increasing of proportion of MAT patients over time. And looking at the P arameter Estimates, month is significant. Note that the Zero Inflation is zero. So if you look at it, it wasn't really doing much for it to have the Zero Inflation but... It was informative for us. Unfortunately, when we went to evaluate the fit of the model, it became quickly apparent that it was a poor fit. So for example, when we looked at your generalized R², it's very low. And then we have this... When we looked at the actual versus predicted plot, what's happening is that here are the predicted values and they're really ranging between 0.1 and 0.4 where the actual data is ranging between zero and one. So what's happening is that the data between here is really pushing this model. And so obviously, what we would have liked to have seen to have is more of the predictions following this line. Having said that, we did see some trends and even though we didn't have maybe a statistical piece, we did have some practical interpretation. So moving on, we did have a control group. And so what we did is that we took the nine control providers were directly compared to the nine ECHO providers and they were equivalent in title and city. So we were trying to match whether they were a nurse practitioner, physician's assistant or a physician, and then where their practice mainly was, their primary city. And the piece of that is that there is a very different demographic when we go from the South of New Hampshire up to the North of New Hampshire. So we wanted to make sure that we captured any of those. So when we did the Matched Pairs Test, we created a confidence interval for the proportion of patients who had any MAT to the total number of patients. We looked at the difference between our control and our providers and we know in that... First off, when you look at the confidence, zero is not in there. So we do have a difference. And we note that when we looked at the actual means you have about your treatment, ECHO is about 0.2... Have 0.2 proportion of patients have MAT. And then for the control was only 0.13. So we do have a difference of 0.07 . So we are seeing that our ECHO group is prescribing MAT more frequently than our control group. So the next was to try to bring in that again, that time variable. And so we looked at the Zero Inflation Poisson Model and so what we see here, so first off the ECHO group is in red and then the control group is in blue. So looking at it graphically, we are seeing an increase in time. It is evident that the ECHO group is slightly higher in prescribing, a proportion of prescribing than our control group. Again looking at the parameters in our model, they are significant. Unfortunately when we moved again to assess our model, we had a very similar result as we did with just looking at the ECHO, which is somewhat not surprising in the sense that we are using the similar data. And so again you're looking at your Generalized R², it's very low, and again we're noting that our actual versus predicted, our model is predicting probably between now 0.05 and 0.4, whereas we were hoping that it would predict along the range from zero to one. So findings. So overall we were able to detect the difference in provider diagnostic patterns before and after they predicted in project ECHO. We did see a small difference in provider diagnostic patterns between the providers that did participate in project ECHO and those who did not. One of the things that we did not control on was the number of total patients. So that may be something to consider later. And then we also noted that there may be a difference on the impact of project ECHO from the different provider level title. And we need to maybe further delve into that. So next steps, this is an ongoing project. These are the next steps that we are considering. So first off, what we would like to do is to include additional providers to increase the size of the database. And one of the ways that we are looking at doing that is to include some more of the ECHO periods. So we do have one that was finishing up this year. So we're hoping to add practitioners in from that period. And we also may want to consider the methodology for detecting MAT and the medical and pharmacy claims data. Also we would like to analyze at the practice level with case control studies to help combat the small sample size. And again, the overall goal is to fit an appropriate model. So we wanted to thank you for taking the time to listen to our presentation on the PACT-MAT ECHO. If you have any questions, please contact us at our following emails. Enjoy the rest of your conference. Thank you. Thank you.
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Exploring JMP DOE Design and Modeling Functions to Improve Sputtering Process (2022-US-30MP-1126)
Monday, September 12, 2022
In an AMAT Six Sigma Black Belt project regarding to PVD sputtering process, the project goals were to optimize several film properties. A Pugh Matrix was used to select the most feasible hardware design. PVD sputtering process parameters were chosen (Xs) based on the physics of the PVD sputtering process. To improve the design structure, definitive screening design and augment design were used to avoid confounding at Resolution II and III. RSM model and least square Fit were used to construct the predictive modeling of these film properties. In addition to main effects, several interaction effects were found to be significant in the reduced RSM model. Each interaction effect uncovered potential insights from two competing PVD sputtering physics models. To further optimize the multiple-stream sputtering process, group orthogonal supersaturated design DOE was utilized to optimize each sputtering process block. Film properties (Ys) compete with each other when searching the optimal design. Three JMP techniques were used to find the optimal Robust Design: Set up simulation-based DSD to optimize the desirability functions. Conduct Profiler Robust Optimization to find the optimal design. Run a Monte Carlo simulation to estimate non-conforming defect %. By using these JMP 16 modeling platforms and techniques, this Black Belt project was very successful, not only in improv i ng the film properties, but also in furthering the understanding of how physics interact in the process. This multifaceted JMP DOE, modeling and simulation-based approach became the benchmark for several new JMP projects undertaking a similar approach. Hi. Hello, everyone. This is Cheng Ting here, and the other presenter today will be Cui Yue. Today, we are going to share with you our experience of exploring JMP DOE design and the modeling functions to improve the sputtering process. First of all, let's have a brief background and introduction about the project. This is the Black Belt Six Sigma DMAIC Project regarding to the sputtering process. The project goal were to optimize several film properties. In the define phase, we have identified the three CTQ and the three correlated success criteria. CTQ 1 is the most important and a challenging one. The success criteria is the measurement result is larger than 0.5. CTQ2 and CTQ3 are equally important. The success criteria for both of them is the measurable result should be less than 0.05. Different JMP tools has been applied massively throughout the measure phase, analyze phase, and improve phase. We will share our experience of using the JMP tools here. In the measure phase, we did a MSA, and we finalized the data collection for the baseline hardware. Three tuning knobs, namely X ₁, X ₂, and X₃ are involved. After data collection, we use a Monte-Carlo simulation in JMP to analyze the baseline capability. This is the first tool we introduce today too. In order to establish your baseline model, we use augmented DOE, RSM, prediction profile, as well as interaction profile, check functions in JMP. After entering the analyze space, in order to do a root- cause analysis and the capability analysis, we use the goal plot, desirability function, multivariate method, and the graphic analysis tools. In the improvement phase, we did hardware modification, where another tuning node X₄ was introduced. Interactive graph, augmented DOE, RSM, desirability function, and interaction profile were used in this section to further improve the process. The GOSS stepwise fit and desirability function were used in the robust DOE modeling to further improve the process. Some of these tools will be demonstrated by Cui Yue later. The control phase will not be covered at today's presentation. As we mentioned, in the measure phase, so after the baseline collection, we actually use the Monte-Carlo simulation to understand the baseline process capability. These are the baseline capabilities for three CTQs. We can see that all of them are normal distribution, indicating a strong linear correlation with the input parameters. For CTQ1, the sample mean is out of spec limit. Thus, we have a very negative Ppk value. The 99 % of the baseline process result cannot meet CTQ1 spec at all. CTQ2 is close to the spec among the three CTQS. Sample mean is lower than the up spec limit. Thus, we have a positive Ppk . The 48 % of the baseline process result, they do not meet the spec for CTQ2. Sample mean for the CTQ3 is out of spec limit as well. Thus, it has negative Ppk. 64 % of the baseline process result did not meet the spec. This baseline capability confirmed that the biggest longevity for this project is C TQ1. Apparently, the baseline process condition, which we'll call process conditional 1 here, cannot meet our CTQ success criteria, especially for CTQ1. We will need to tune the process condition. However, before that, we will need to know if the current hardware can meet the requirement or not. The subject matter experts, the SME, has advised us, has proposed the two hypothesis, and advise us to shift the process to condition 2 based on the second hypothesis. However, before that, we will need to check if the prediction model is good for both condition. Hence, we use the scatter plot to check the collected data structure. As we see here, the data collected is not in a orthogonal structure. This is because we actually use a two- step evaluation design, and widen the process range to meet the success criteria of CTQ1. We did have a weak prediction capability in the whiter area. However, we still have good prediction for the condition 1 and condition 2. We also did a confounding analysis. The fact there is certain confounding risk in the Resolution II between X ₁ and X₃. Nonetheless, we still built a prediction model. We use the response service method for the fitting. In this case, the main effect, interactions, and the quadratic terms will be fitting together. Based on the RS, S R as well, RS- adjusted square, and the p value, we can see it's a valid prediction model. From the effect summary, we can see that only the significant terms are included in the model. With the interaction profile, we can see two interactions which are correlated with two hypothesis we mentioned before. With the prediction profile, we pick the process condition 2. At this process condition, what we can see here is 95 % of the confidence interval of CTQ1 is between the range of 0.5 and 0.6. This CTQ has been tuned into spec. However, in the meantime, the CTQ2 and the CTQ3 are out of stack. Hence, we use the goal plot to compare the two process conditions, and re alize that when we have improved CTQ 1 from process condition 1 to process condition 2 by getting closer to the target and a narrowing down of standard deviation. However, in the meantime, C TQ2 and 3 were compensated with larger standard deviation and the further distance to the target. Hence, there is a tradeoff between three CTQs. In this case, we try to find an optimized solution with the desirability function in JMP. For CTQ1 , the success criteria is that mode of 0.5. H ere, we use the maximum plateau method when set the desirability function. Means any value more than or equal to the target will be equally preferred. We also highlight the importance of CTQ 1 by putting 1.5 as the important factor. Fo r CTQ2 and 3, the success criteria is less than 0.05. Hence, we use the minimum plateau when set the desirability. Any value less than or equal to the target will be equally preferred. However, after maximize the desirability, the calculated optimized solution is only around 0.02, and none of the three CTQs can meet the success criteria. Hence, we can conclude that there is a hardware limitation in this case. After discussion with the SME, we decide to introduce Y₄ into our data analysis. Y₄ is not a CTQ here, but it is a measurement that can reflect the intrinsic property of the process. This intrinsic property will affect the CTQ2 and CTQ3 directly. if Y₄ is more than zero, then it's a positive process, and when that Y₄ is less than zero, it's an active process. If Y₄ is closer to zero, then we call is a neutral process, which lead in to a smaller number of CTQ2 and CTQ3 together. Here is the distribution of Y₄ of the baseline hardware. As what we can see here with the baseline hardware, the Y₄ is always more than zero. That, we will always have for positive process. The multi variate graph here actually shows the relationship among the Y₄: CTQ2 and CTQ3. They are strongly correlated. If we have smaller Y₄, we will also have smaller CTQ2 and CTQ3. In order to have a wider range of Y₄, we decide to add in another factor, X₄, in the improve the hardware. Together, another two scientific hypothesis has been proposed by the SME. We have collected data on the new hardware and compare the Y₄ distribution in two hardwares. In the baseline hardware, without X₄ , we have collected data orthogonally and with certain range for each factor. This is the distribution of the Y₄ under these contributions. With the improved hardware, with the X ₄ introduced, we have collected data in the same range of X ₁, X ₂, and X ₃. This time, Y₄ a t different X ₄, has also been collected. Comparing the two distributions, we can see that without X₄ , we only see one cluster for Y ₄, with the peak value more than zero. However, with the X₄ introduced, we can observe a bimodal distribution for Y₄, with one peak with mean more than zero, and another peak with mean less than zero. The process condition makes Y ₄ less than zero actually draws our attention. Under these process conditions, we will have negative process, and this may help us to improve CTQ2 and CTQ3, but choose that process if we cannot meet all CTQs in one process only, because neural process benefits CTQ2 and CTQ3. We did a simple screen of processed conditions when we have a negative process, and this lead us to a certain range of X₄. That's why we collect more dat a in this range, because it's our condition of interests. Now, we conclude that the X₄ did impact on Y ₄, and thus, it can impact on CTQ2 and CTQ3. Now, we can further with the impact of X₄ on CTQ1, and build another model for the improved hardware. Prior the data collection, we have prescreened the conditions of interest using the interactive graph in JMP. We will collect more data with certain range of X₄, because this X₄ actually can give us the negative Y₄ value. It also covers most ranges of Y ₄. As we can see here, before the data collection, as what we can see here, this is not the most orthogonal structure, since we have collected more data at the conditions of interests, even though after doing a design evaluation, we find a low con founding risk. The data structure is still good for modeling. This is the model we constructed. As what we can see here, we have an adequate model. Only factor with the p value less than 0.05, or what included in the model. The RS quare is more than 0.8. The difference between the R Square adjusted and R Square is less than 10 %. The p value for the whole model is always less than 0.0 01. Also, through the interaction profile, the hypothesis 1 to 4 has been validated. This time, can we find any optimized solution? Again, we run the desirability function. The left side is the optimized solution provided with the baseline hardware before the X₄ installment, and the right side is the optimized solution with the improve d hardware with X₄ installments. As what we can see here, compared with the baseline hardware, improve d hardware did provide an optimized solution with higher desirability and improve the result for each C TQ. However, the desirability is still low, which is only 0.27. Not all CTQs meet the success criteria in one step. So we still did not find that adequate solution in one-step process for the project. However, but as what we mentioned previously, since we have a cluster of the process conditions, allow us two negative process with the Y ₄ less than zero. We can propose a potential solution with two- step process. The solution with two- step process may not be that straightforward. As we know, if we can find the optimized solution in one step, all we need to do is just to round the process at the conditions, gives the maximized desirability. The result will be predictable since we have a known model for it. Now we want to have a two- step process. For each step, we have a known model, and if the process condition is determined. However, due to the different process duration for each step, we will have a new model for the final results. In this new model, we will have nine variables in total: X ₁ to X₄ for each step, and the duration for each step. Now, the question is, how are we going to find a proper solution for the two- step processes? We've got two strategies. The first one is to do a DSD modeling for the nine variables. In this case, we will need to have at least 25 runs for the first trial. Of course, we will have orthogonal data structure. The RS M model can be constructed, but the cost will be very high. The other strategy is to screen design with the group orthogonal super-saturated design first, which is the GOSS design. In this case, we can screen the impact of seven variables with six run. This is why it's super- saturated. We have more variables than the data points. Of course, we will need to screen out two variables before the GOSS, and we use the interactive graph again in this case, and the details will be reviewed by the next slides. The GOSS design provides two independent blocks for each process step. There is no interactions between the factors across block. The data structure is orthogonal in each block, making it possible to screen effect with the super-saturated data collection. However, this GOSS will show the impact of main effect only, and no interaction will be considered. This is a low- cost approach, and we manage further DOE design. The follow-up DOE can be DSD, augment, or OFAT. Each of these has its own pros and cons, and it will not be covered in this presentation. Anyways, to save the cost, we decide to proceed with strategy 2. We start with a GOSS design. As we discuss ed now, we have nine different variables. However, in the GOSS, we can only include seven variables. In order to narrow down the parameters for the GOSS design, we did a simple screen with the interactive graph. For step 1, we choose the process conditions, allow us to have a good CTQ2. After screening, we decide to fix X₂ in this case, based on the previous learning. As seen here, when the CTQ1 is more than 0.5, all we have is a positive process with the Y₄ for more than 20 %. Hence in this case, for step 2, we choose process conditions, allow us to have the Y ₄ less than -0.5 so we can have a negative process. In this case, adding two steps together, the final Y ₄ will be closer to zero. This way can improve in CTQ2 and CTQ3. After screening, we decide to fix X₁ for this step, based on the previous learning. After data collection, we did a step wise fit with main effect only, since in the GOSS, as we mentioned previously, only main effect is considered. All three CTQs validated the model with the p value less than 0.05, and the RS quare adjusted around 0.8, and VIF less than 5. After maximize the desirability, the model provide us with an optimized solution with desirability more than 0.96, which is way higher than 0.27, and not to mention 0.02. Hence, we can actually lock the process parameters and further by the optimized solution in next step, which is we choose to use the OFAT. But this will not be covered here. Here, I like to summarize what we have discussed in this presentation. In this presentation, we have and share with you the experience. We use different JMP tools involved in the data analysis throughout the DMAIC project in different stages. For the base line capability analysis, we have used Monte-Carlo simulation. We also used a goal plot. For the root- cause analysis, we used multivariate method. We also used a graphic analysis. In order to help with the DOE, we used the augmented DOE, GOSS, and design diagnostic. In terms to have a good model and prediction, we used different fitting functions, and we also used the prediction profile. We also used the interaction profile. It also was mentioned that these profiles are not only used for the model and the prediction, but it also help us to have a deeper understanding of the process itself. In order to screen out the conditions of interest, we actually use the interactive graph, which is simple, but very useful and powerful. In order to do d ecision making, we actually use the desirability functions help us to make a decision. Until now, we share with you our experience of how the JMP can help us to do the analyze. The last but not least, we would like to thank Charles for his mentorship along the progress. Thank you. My partner Cui Yue, she will share with you some demonstration with the JMP side. She will help us to demonstrate how can we use the interactive graphic analysis as well the stepwise fitting in the GOSS model. Okay, thank you. Thank you, Cheng Ting. I think you can see my screen now, right? The first thing, I would like to introduce you the interactive plot Cheng Ting has just mentioned. This is actually one of my personal favorite function in JMP. It's simple, but it's very powerful. Here, our purpose is to screen what's the most relative function, a factor relative to Y₄. Which one contributes the most to Y₄, from X₁ to X ₄? We can simply select all the factor of interest we want here, and click OK. Now, we are having the distribution of all the factors. Now, as Cheng Ting has mentioned, we want to know what contributes the most to the Y₄ at negative side. For example, here. Then now, we can see only X₄ from 13- 14, X₃ from 0- 1, and X₂ at this range, 2.5 -5. Also, X ₁ at 19-20, can make this happen. Thus, it's easier to give us a range of different factors, how it will contribute to Y₄, and how we're going to choose the factors if we want the Y₄ reaches certain level. Similarly, if we want it to be slightly higher than 0 %, we can also simply click this area. Actually, to be quite straightforward, to solve the problem at one shot, we just select these two together. Then, we can see, okay, X ₄ should be at this range, maybe 10- 14. X₃, definitely should concentrate at 0-1. X₂ is a slightly wider distribution. X ₁, there are o nly two c andidates for this direction. From this one, we can easily and intuitively find the contribution factors we want. This is the first function I would like to demonstrate to you, the distribution function. There are many other things the distribution function can do, including see the data distribution and test a lot of thing, [inaudible 00:24:49] test , et cetera. So I won't come in here. The second thing I want to share with you is the GOSS. It is actually the GOSS fit stepwise. Now, we are having three CTQs. All the CTQ, the analysis dialogue is open here. For all the three, they have separate analysis dialogue. To do a very straightforward way, we just the Control and click Go. It will select the factors for all three C TQs at once. Very straightforward. These three will use the same stopping rule, which is the minimum BI C. Then, let's click Run M odel. This fit model will give us our fitting separately for C TQ1, 2, 3. Then, we need to modify the model or reduce the model one by one. Here, our criteria is to choose the p value lower than 0. 05. We define that value. When the p value lower than 5 %, the factor is significant. Here, we can remove X₄ . Next one, for C TQ2... For CTQ1, we are down here. For CTQ2, we can remove X ₁ and X2 . Okay, now we are having both lower than 0.05. For X₃, w e also can do the same thing accordingly. Let's move X₄ first . All the three factors, p value are lower than 0.05. We have get the reduced model. Here at the bottom, we will have a prediction profiler. If you don't have it, we can add it from the profiler function. Then, we would like to find the optimum condition. How we are going to do that? We are going to use the desirability function. First step, we will always be set the desirability, which is already set here. We have one me ta-gate to minimize. Then, let's use the maximum desirability function. H ere, we can find our optimum condition. If we use Maximize and Remember, here is our optimal condition. Then, we can use this condition to run the process and valid ate the sequence again. These are two functions I'd like to introduce you. Okay, thank you. Thank you.
Labels
(12)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Utilizing Several JMP Platforms To Conduct Measurement System Analysis in DMAIC Black Belt Project (2022-US-30MP-1106)
Monday, September 12, 2022
The goal of this Six Sigma BB Project was to improve the Pin Gauge Measurement Capability of the diffuse hole size, which is critical to the Production Failure Analysis. Several JMP Analysis objectives are in the Measure Phase. First, compare Excel Xbar-R method vs. JMP ANOVA cross method. Second, choose GRR criteria between Precision to Tolerance (P/T) ratio and Precision to Total Variance (P/TV) ratio. Third, use SPC Control Chart to monitor the GRR stability, determine the process long term sigma for calculating the GRR P/V ratio, and identify the process rational subgrouping. Fourth, evaluate the Pin Gage wear risk to determine whether the Pin Gage measurement is destructive test and use the GRR Nested model. Finally, use GRR Misclassification to assess the risk of both the Alpha risk and Beta risk on the production yield/cost. JMP platforms have helped us execute this Six Sigma BB project effectively. Hello, everyone. My name is Kemp. Come from Taiwan. I work in a Prime Material for global continuing improvement. And other author is Wayne also from Taiwan. He work in Prime Material for customer quality engineer. Today we will present how to utilize SIPOC JMP platform to combat measurement system errors throughout your purchase. The project is measurement system errors between supplier A and supplier B. I will go through for the two topic. For the introduction and supplier A measurement system. Then Wayne we will go through the next two topics. For the introductions, I will go through the background, SIPOC and in and out scope analysis, correlations of the MSA CTQ. Supplier A measurement system, we will cover the Excel Xbar-R vs JMP ANOVA crossed method. Gauge R&R criteria on P/T and P/TV ratio, variability chart and Gauge R&R mean plots, Gauge R&R mis-verification. Supplier B measurement process, we will cover CE diagram, Fit Y by X contingency platform. One simple t and sample power t-test, Gauge R&R. The final conclusion will be show SPC control chart for Gauge R&R, and a summary of analysis. The rest is the takeaway learnings from the JMP. First topic we will go through for the introductions. The project was based on the problem statement. We understand the VOC. The purpose of this project is both supplier A and supplier B measured the same parts. Supplier B has got a much worse result unexpectedly than results of the supplier A. Customer has requested to find out the gap and validate the measurement system of supplier A and B. Then we go to for the SIPOC to understand our CTQ. They are three CTQ, we find out. C T Q1, standardize supplier B measurements, which we need to meet for at the certain criteria, resolution need to be less than 10 %. C T Q2, verify the gaps between supplier A and Supplier B. So we need to meet... The bias need to be zero. CTQ3, supplier B Gauge R&R which we need to meet the criteria for the CTQ ratio less than 30 %. This is the SIPOC structure. We start with the customer voice and back to the output CTQ, which we'll find out. A nd back to the process items, all process they're bound the measurement system. With these four sequences, states for calibration to operation. And the pin gauge calibrations. Then sample for the diffuser, hole measurement. Finally, debate hypothesis through the same way like PQP Excel file. As you can see, this is all about the MSA. On the following slide, we will go into picture of the relation about the SIPOC and the VOP and the customer VOC. There are two measurement output for the supplier A and the supplier B. In our goal s need to measure our measurement system, supplier A and supplier B need to be the same. On the measurement process variation, there are two parts. One is the accuracy, the other one, the position. For the accuracy, is due for the bias linearity. And the solution for that, stability is not in our scope. For these three is indicate for our CTQ1 and the CTQ2. For the CTQ3 is ou tput for the precision problem for the repeatabilty and reproducibility's. We will go through the second topic, supplier A measurement system. In our Gauge R& R errors, we use the continued data, crossed way and don't form for two best. What is the Xbar-R method? And otherwise, ANOVA method? In Xbar-R method is for the PQP Excel file. They have two disadvantage, the first one, it could not detect parts- to- operator interaction. Second, it is made with the Gauge R&R, with the normal distribution. So you will be impact by any outlier. Also, is not good for the sample data is more than five data points. In ANOVA method, it not only can detect parts-to-operator interaction, but also use standard deviation directly . So it don't need to be considered for the normal distributions. In our project the interaction is high so w e must need to be use the ANOVA method for our Gauge R&R analysis. First, we chose the low data funds for a PQP file. They use the 10 data points for the study. Actually, it is not good for more than five data point, X-bar analysis, as I mentioned for this slide. Anyway, we still use the same data into JMP ANOVA as well . The first somewhere PQP is no file is only just useless file errors. Y ou tell us there are two ratio for Gauge R&R, P/TV is 24 %, is measured [inaudible 00:06:52]. P/T is 9 %, which is pretty good. What is the difference between P/TV and P/T ? W e should know there are three variations. One, equipment variations, which is called repeatability. Another is the operator variation, which is called the reproducibility. The last one is the part variations, which depend on simple sessions. Now, P/ TV definition is P is the process equal to EV plus AV. And TV is total variance equal to EV plus AV plus PV. There will be a risk when gage sample range will highly impact by P/TV ratio result. In P/T definitions, P is tolerance is Parts Spec. R ange. When spec is well defined, P/T ratio is our prefered to MSA for doing Gauge R&R success criteria. For the second [inaudible 00:08:13] I'm sorry. For the second [inaudible 00:08:17] JMP has no Xbar-R errors, only ANOVA. There are two model. One is the main effect model, and other one is crossed model. Main effect model has no parts-to-operator component. Will be most assigned to repeatability and is caused by machine issues. On the other hand, crossed model can correctly derive the parts-to-operator interaction component, which will tell us there is the issue happen on the Gauge R&R, on the process issues. We prefer to choose the crossed model rather than main effect model. On the Gauge R&R variability chart, on the left side of the graph measurement mean. Here we need to be see is there 50 %, of point, is outside of the control maybe. It indicate the case is capable to detect the [inaudible 00:09:32] . Our case is a 100 % outside of the control, which is good. In the standard deviation chart, on the operator B, on the parts A has repeatability issues. There are 25 % of the data is on the zero , which indicate measurement result is not attainable. On the Gauge R&R, mean plot for the measurement operator. We can see for the operator C, is higher than operator A and B on measurement by parts. You can see the data is up and down, it indicate large different between parts. Also, the sample range is 1.2 equal to 20% of the tolerance range , which is too small, resulting in higher P/T ratio. The last one is the parts- to- operator interactions. On the parts 7 to A, there is the crossed line between part operator B with the operator A and C. It indicate interaction between appraiser and the part. We finally going to the last part of the Gauge R&R. The type one and type two error. The type one error is often rejected alpha risk. Maybe in manufacturer side parts [inaudible 00:11:04] is good but we reject. And otherwise, type two error is often accepted, beta risk, which means is customer side part [inaudible 00:11:16] is bad. But we said, and this side, we can see there is nothing. We can see that alpha is zero and the beta is 39 %. If you are still not clear , we can show you the example. In manufacturer side, produces 300 parts. There are 200 parts, is good parts and the 100 parts, is bad parts. If the beta risk is 39 %, which we'll deliver 39 bad parts to the customer site. We conclude 39 % of the parts, will be delivered to the customer, which is not acceptable. We need to consider for the spec limit for the adjust our offer and the beta risk. In the data Gauge R&R samples are all good, 100 %. And only beta risk have been observed. Here are the Gauge R&R summary. We recommend P/T ratio instead of the P/TV ratio. We recommend use the JMP crossed method for allocate your Gauge R&R errors. Even P/T is best, we still have to watch the operator-to- parts interaction the misclassification for our parts quality efforts. Okay, I finished my part. So the next two parts will be show by Wayne. Wayne is your turn. Hello. This is Wayne speaking. Let me uncover the last two session for the supplier B measurement process. This is C&E diagram. Fish bone, to identify potential cost across standard procedure and the standard supplier B and supplier A. After last discussion with engineering, we conclude five potential good cause to standardize and validate the first item is pin measurement sequence. Supplier B adapted from larger pin to smaller pin by standard procedure. [inaudible 00:13:59] both way, supplier A adopt smaller to larger diversity. The second item is precheck by calibration. Supplier B didn't do the action by a standard procedure and supplier A did. The third action is a pin gauge resolution. Supplier B used the larger pin gauge increment, 50 % in resolution, while standard and supplier A take 10 % in resolution. The fourth item is whether to check the pin before entering the diffuser hole or not. The last item about the pin holder weight. Supplier B adopted behaviour, than standard procedure and supplier A. Now we will do one by one, the hypothesis test and valid data with high score contingency. The first hypothesis we have to do is verify if sequence one is different from sequence two or not. Sequence one pin go from smaller size to larger size. Sequence two is just on the contrary, reverse pin. Here is the major result for the sequence one and sequence two. Let's do the chi-squared test to see the difference between [inaudible 00:15:25] or not . In contingency table, the results shows p-value less than 5 %. So we rejected a null hypothesis, which means sequence one is rather different from sequence two. Therefore we must watch out this key factor on pin gauge measurement. The second hypothesis is the way that you do precheck pin size by calibration tool. Learn your calibre before the measurement. Here is the diffuser, whole size data. The assumption is that we mistake 90 for 90.6 without precheck. There will be 11 zones become no- go. Based on contingency table. In chi-squared test, we reject null hypothesis. This means calibration before measurement is significant, important. The third hypothesis test is to compare all the measurement tool, 20 % resolution with new two or 10 % resolution on the measurement variation. By using the similar concept, we can know why we used the higher resolution on the measurement. The same on the item four, shaking the pin before entering the hole. We use the specific 48 holes with and without shaking to see the go/no-go effect. The results shows null hypothesis is rejected. That means shaking pin is important on measurement. However, regarding the item five pin vise all this weight... Chi-square p-value more than 5 %, which means we cannot reject the null hypothesis. That tell us there's no difference on the measurement between the unit and five times unit in weight. Because we have varied data all of five key measurement items above. FMEA can further estimated RPN before and after the improvement of our recommend action. Based on severity, probability and detectivity. For [inaudible 00:17:55] pin gauge measurement sequence, severity is high because the wrong go. Past probability is also high because dis location to hole center, W e use the sequence one because the high detec tivity. After we change to sequence two, the probability and detectivity will drop to the half, so RPN reduced to 54 As y ou can see all the score are under 100 meeting our forecast. Okay, let's move on the CT Q2 about bias between supplier B and supplier A. From the FMEA, we are confident of the diffuser pin hole size is around 90 to 90.6. However, supplier A, FA report shows all the pin hole points 96. We take distribution mean test, the p-value suggest the gap between supplier B and supplier A actual standardized is really something different. Therefore we have last communication with the supplier A. We conclude the two potential cause. One potential cause might be some hole point 93 some are 90.6. In current symbolic method parts straight over into 24 wrong. Sample one randomly in 24 wrong. Is it enough to find out a larger size? So we confirm with the sample size and power test. The other possible cause could be the hole enlarged during the measurement. W e validated by repeating the pin gauge measurement. All right, let's see the result . Here you the sample size and power test result. In exact test binominal, power is equal to 99.7, that's more than 90 %. In normal approximation, power is 97.6. That's more than 90 % as well. In other word for power, more than 90 % will just require 18 points to major. Therefore, current stratified random sampling is good detectivity. It is a fat chance that we cannot catch 96 in diameter and current 24 straight by wrong. In parts do have larger hole size. Now, after the gap on was found we further check if our CTQ3 Guage R&R is qualified or not. F rom left table, you can see the hole diameter was increased by 4 % only compared to tolerance. The resolution is less than 10 %. We can assume nondestructive crossed method , to qualify this measurement capability. Now, although the operator- to- parts interaction accounts for 13.6%. P/T ratio is still 18 %, which would meet Gauge R&R criteria, less than 30% Later we going to address more about interaction here regarding the P/TV ratio. Sigma and variance also are high and cannot be... Quantified because the same hole size selection is not wide enough. Layers for the reference only. Okay, for our conclusion and about the SPC control chart to monitor the Gauge R&R stability. We use the Levey Jenning chart to get the process long-term sigma for Gauge R&R P/V ratio calculation. Phase before in supplier B, with measurement resolution 50 % shows larger sigma and wider control limit according. After, in supplier B, with measurement resolution 10 %, shows smaller sigma and narrow control limit. It got improved on the measurement precision for a common cause. Here is the summary for a CT Q1 standardized supplier B measurement with the resolution less than 10 %. Only pin holder weight is not significant the other four measurement item are all significant. For CTQ2 , sample size is good and the hole enlargement issue during measurement was found. For the CTQ3 , supplier B provide its qualified measurement capability Gauge R&R less than 30 %. Where the repeatability less than 10 %, and re producibility less than 20 %. For take away learning, we use the last JMP tool. Like C&E diagram, Fit Y by X contingency table, sample power test, distribution mean test ANOVA crossed Gauge R&R to standardize our pin gauge measurement. That's all about the JMP practicing. Thank you for listening.
Labels
(12)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Access
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Using JMP 16 Data Mining and Text Mining Platform To Analyze Display Production Cycle Time (2022-US-30MP-1104)
Monday, September 12, 2022
JMP platforms have significantly helped find the right parameters to determine optimal process. This presentation demonstrates production cycle time analysis using JMP 16 data mining and text mining, using the distribution platform to set up histogram conditions for systemic root cause analysis, and building three partition platform models to improve the R square. Then we will optimize the partition model for both success and failure analysis, create a Neural model and use the text explorer platform to search key words that trigger the modeling parameters used in the predictive model. Hello, good morning. Good evening everyone. My name is Raisa. I'm a manufacturing quality engineer of Applied Materials, Taiwan. I started to learn JMP at beginning of this year. Recently, I pass ed this certification exam with score, 925 in this July. Today, I'd like to make a short presentation about QN Immediate Fix Time Analysis by JMP. As we know, once a quality notification cure is created, it must take additional, more or less time to fix issue, and may impact on production planning and scheduling. Therefore, we'd like to find a worst- case by analysis. Okay. For analysis, here are five s ub topics on agenda. First, the Root Cause Analysis of QN Fix Cycle Time, Graphical Root Cause Analysis Summary, Compare Fit Model, Partition, Neural Model, and then H ybrid Text M ining and the Data Mining Analysis. Finally, Take Away Learnings. Okay, let's get started. The Histogram. 1st layer of Root Cause Analysis of QN Fix Cycle Time. Before investigation, think about that. What scenarios impact on QN Fix Cycle T ime and how long is it endurable? First up, define five days as a criteria and also a key condition to follow C-wide spread i n wording and within five days, we in spec and success criteria. On the other hand, over five days out of spec failure analysis, later, I make directly a breakdown to SA and the FA. Notice this shade of distribution between SA and the FA. Meanwhile, look at the Mosaic Plot for the proportion of each category to infer a potential r oot cause for the success analysis. We can see the Workmanship and MFG rework, seem to have quick response and better fix cycle time. For FA dimension issue, take more fix cycle time. It is obvious variation in FA time. Distribution between SA and the FA, suppose if FA time is one of the key factor, S1 to impact the f ix c ycle time. The Box P lot. The box plot is a graph of the distribution of continuous variable. Therefore, plot continues fixed cycle time versus nested structure, categorical country under containment to search other factors impact fix cycle time or not. It displays the five- number summary of setup data. It is non- parametric tool to use Median as central tendency. Besides, there are some observations on box plot graph. First, at least seven point to detect the first outlier. Otherwise, it becomes whisker (skew) group problem where sample size is less than seven . Second, observed screwed distribution by Box Width or Whisker Length. How to handle marginal outliers, which are we think two Sigma GRR noise from whisker and back to the Root Cause Analysis, it is not difficult to find a recycle time of the containment replacement is much longer than other containment. And with that, X₂ containment and X₃ country here. Heatmap. Heatmap is another graphical tool to defect data value by color. A gain, until now, we gather three input factors, defect type from histogram, containment and country from box plot and in order to follow study quarter impact quarter input impact on fixed cycle time. Here at back categorical called defect type on Y axis and the color cycle time and keep bus prognostic structure. Categorical country under containment in X and X group. Then use a 8 by 9 layout look balance to quickly catch out the maximum and the minimal cycle time scenarios. For FA, it is easy to find a little red area, right? The library highlights the longest fix cycle time. With that, Replacement Taiwan Damage parts, in the worst case for cycle time. The Replacement United States and the Dimension issue parts is the second worst scenario. For SA is set for dimension and damage defect. Others are easy to quickly fix. For Pareto C hart, to further analyze the FA and SA from heatmap , heatmap must use two- dimensional Pareto Chart by to variable defect type and the country under specific containment. Here are X₁ defect type, X₂ containment and X₃ country we mentioned before. Then add additional workstation X₄ here in the course of Pareto C hart to visualize frequency event. Now for FA failure analysis, we get replacing high work supplier damage issue, frequently happen in CVD service fraud or replacing United States supplier dimension issue often happens in CVD workstation. In the same way for X analysis, instead for dimension issue or damaged defect, United prior, functional and a workmanship issue can quickly fix in CVD major test. Currently, we have four input factors and SA and FA frequency. For interface, one we are more interest in is pass or fail frequency or pure cycle time. Then Tabulate. Here put our previous mention of factors X₁ to X₄ for on Tabulate. Meanwhile, Tabulate pure cycle time and frequency into a account to do further comparison. For FA, CVD service fraud, damage issue, Taiwan supplier require replacement, it did take a longer cycle time although the frequency is now the highest, like seven times here, the means of the cycle time, 34 days is much longer than others. For SA, in CVD motor test, we can measure issue in United States prior and fixed by MFG re work. Even there is only one day on the table, but the frequency and is far too low to be true. Here I summarize the main points. Follow Root Cause Analysis, use different graphical JMP platforms in engineering and large caustics sequence to conductive prior Root Cause Analysis. In previous slide, I show Histogram, Box plot, Heatmap, Pareto Chart and the Tabulate. Second, identify a potential input X to protect the QN fixed cycle time. A ccording to the Tabulate, the FA results from a damage issue, replacement, Taiwan suppliers and CVD service workstation. Next, build a model to predict QN fix cycle time and validation of the root cause. Before entering each model's detail, here, I'd like to introduce model selection and the comparison first up. The fit model, consider data structure and a distribution. Here are some challenge in fit model. For skewed distribution, use log transformation, but no help. All input variable, X₁ to X₄ are categorical type. After I build our 60 % of workstation category, R-square increased 6 % only. Check dependency among a categorical variable by correspondence analysis part. It is low risk because the closer things are to a region, the less distinct that they probably are. In other word, the farther away the more distinct. Second, proximity between labels probably indicate a similarity. For partition tree model, the plus points are distribution free model, split based on data available, little overfit concern, but minus points recursive split. Therefore, use JMP projector screen by random forest to average a recursive product and find out a five input factor with their ranking. It is convenient and a quickly way to find important input to optimize or improve model. Regarding a neural network, the plus points are strong transformation model, two steps training and a validation model. However, the minus is significant overfit concern . Which model is more proper to be believed that goes through each model results? Come back to fit model, main event only. If our score isn't high, only 30 % are wrong, because data is severe right skewness. Observed significant level of risk, so Max R- square around just 47 % is not worth it and use log transformation of the cycle time variable to avoid a negative number of 95 % confidence interval, but no help a lot so log choice is out. The next is Partition Tree Model. Here are three partition models, are baseline model, model augmentation and a model simplification. Experience a series of improvement per engineering and the logical thinking that R square improved to 62 % from 38% baseline model. All the detail will show you step by step in following slide. Model augment. During this step, we improve model 20 %. Where are they from? First, they will present improve. Here, changing QN age to immediate fix cycle time for propriety experience, but no help. Second, 6% and add one X factor workstation. Remember it is export, we discover from Pareto C hart. UD code becomes less critical from 26 % to 8 % only. The third and ano ther 4 % by changing to containment from UD code. Now, check a contribution ranking here. The number two become workstation instead of country anymore. In model simplification, here, improve additional 6 % R square by model simplification. Before simplification model, the plus is all scenario under consideration, but like minus two, many categories might dilute predict power. In simplification model, filtering out minor categories with fewer counts, like remove 60 % categories of workstation, the total amount decreased to 270 from 426. Check again, the contribution ranking, the defect type and our workstation are still the number one and the number two. Now, we have more confidence to use the model to predict the FA and the SA. For Partition Tree Model Optimization as I mentioned before, the major contributors are Defect type & Workstation around 80% for Pareto C oncept. Compare defect type in SA and FA prediction. For SA here, it's labeling issue. Makes sense, we don't spend more time to fix every issue. For FA, it's damage. Yes, it would take much more cycle time. About a workstation comparison between SA and FA PVD mechanical and CVD module tester. Currently, it still needs further analysis and understanding. About the FA country here, in a profile or prediction profile is flat. Doesn' t country impact QN fixed cycle time? Is it right? To answer the question here, I introduce the model limitation, recursive partition. Recursive partitions, sequential dependency risk. Factor country is spread six times, and only one time happen in higher cycle time cluster. Such recursive dependency limitation may impact the predictive model. The third model, Neural Network (Artificial Intelligence). Here observes severe overfit concern between training and the variation R square. If R-square between Training Set and V alidation Set is over 20 %, it has overfit concern. Besides, we find it in neural model, the number one ranking is workstation, and the number two is fault by which is different from previous partition model. For SA, the workstation is at staging, CCT staging, where material are brought together before entering MFG fault , and it doesn't have competitive operation process. Once the issue happen, it can be fixed quickly. Makes sense. For FA at the CVD service fraud, it has competitive operation process. Yes, it did to have longer cycle time to treat difficult issue. Until now, we already have three model, Tree Model Partition and the Neural Model, and which model is much more proper and meet reality. Therefore, model comparison and selection, Root Cause Analysis, graphical tool damage issue replacement, Taiwan CVD service fraud is the worst scenario with longer fixed cycle time. Currently, Neural model has the identical scenario as the Graphical Root Cause Analysis, but only concern is overfit risk. Besides, the three model has very close prediction on the worst cycle time within 1.2 days. The final I will introduce, Test Mining and the Data Mining Hybrid. Currently in QN D ata base, it still has test messenger resulting well to such more information about long cycle time in QN tester variable database. Use JMP Test Explorer to discover some frequency keywords such as here, I circle replace, rework, dimension and the F10246 a project to do further analysis, then convert them to binary detectors, conduct a further Data Mining and the Root Cause Analysis on F10246 case via heatmap graph. Here, put dimension indicator under F10246 and containment replace and re work in Y. According to the heat map results, F10246 it did suffer lots of fix cycle time that other project by color results and check dimension detector observed is not only dimension issue, but also our various defect cause long cycle time, even if just are fixed by rework. In the end, here are my takeaway learning. JMP G raphical Platforms are very powerful to conduct deeper r oot cause analysis through engineering and the logical, data- driven process and compare and select a more appropriate JMP model from Classic Fit M odel, P artition and a Neural Network by knowing the model limitation and the risk of connecting to previous Graphical Root Cause Analysis. Conduct a Hybrid Text M ining and Data Mining R oot Cause Analysis on the complicated QN Database. Final, I'd like to thank GCI M BB Charles Chen as my project mentor and that's all my presentation. Thank you for your time and attention. Thank you.
Labels
(9)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Use Case: JMP in the Building of an Image Analysis Pipeline (2022-US-30MP-1096)
Monday, September 12, 2022
Sidewall damage of subtractively etched lines is a common problem in semiconductor manufacturing. This presentation aims to show how JMP’s various analysis platforms can be used with a simple image analysis pipeline structure built into JMP using a Python/JSL integration approach to build a highly predictive model and a method to integrate and simplify the models output to be used in a standard trend chart used for monitoring in manufacturing. Our methodology uses a metric extraction protocol from images inspected and dispositioned in manufacturing. These extracted metrics and dispositions are then passed through a series of model building and tuning platforms in JMP. Our approach involves using dimensional reduction platforms (PCA,PLS) to identify relevant metrics and building a neural network model using K-folds validation to compensate for class imbalance problems often encountered when working with manufacturing data. Further refinement of the model is done by identifying outliers using the multivariate platform and reprocessing the data. This approach resulted in a highly predictive model that is easy to integrate. Hi, my name's Steve Maxwell. I am an engineer. I work in the semiconductor industry. Today, I'm going to give a presentation on how I use JMP in the building of the image analysis pipeline. A little bit of background on the problem. Two of our module groups at our facility ran into an issue with a feature where there was erosion on the sidewall of the feature. The SEM system that we used to measure the size of this feature wasn't detecting this feature, and it didn't have the capability to characterize this. It's an older piece of equipment, but it's also paid for so we liked it. The approach that I employed to help them out was to build a separate analysis pipeline to analyze images, to classify images that's having damage or not , and it produce a metric that could be sent to SPC system for tracking purposes. I use JMP primarily to analyze data, to build a model system deploying and manufacturing, as well as, more recently, integrating some of the actual image analysis into JSL. I'll show some examples of that. This is a sample of what it is that we're looking at. This is a good image the left- hand side, and this is a bad image on the right-hand side. We can see here the like variation and erosion. That's what we were looking for in these images. We could see from here that the measurement, which is done on the inside of this wouldn't really reset to these features that lands at the same spot every time. We're not able to tell by just using our normal methods. This is a little bit more detail about what we were seeing. This is an example of what some of that data looked like if we could see from here that though the process was having issues, we were not detecting it with the measurements that we were using. We launched into an approach to address this issue. I use the standard ETL format for most data science problems. Extract, transform, load. I got to break it into four categories, configure, collect, analysis and report. Configure and collection, standard methods. Analysis, that's the key to this. That's how I'm reprocessing the images and then extracting metrics for those images that we can use to build a model to determine whether or not, we could detect the issue at hand. The approach for that analysis is to convert the images from RGB, which is come off on the system, the measurement system, and convert those to gray scale. Then the images are cropped, remove any measurement details, everything like that out of them. Then the next step is to apply median to these images. The reason why do that is we're going to blur the images and remove some of the noise out of the image. This is a key step for image analysis and image processing. The approach was used as an experimental approach different kernel sizes, different kernel shapes, but in the end, is settled on using a disk- shaped kernel with a pixel radius of like 5, 7 , 10, and 15 pixels. For those that are unfamiliar with that terminology, kernel is basically just an array that's used to deteriorate the image when you're reprocessing it. After that was done, extraction of the metrics from the reprocessed images was developed, used several different standard metrics, structural similarity index, mean square error, Shannon entropy, then a few different ones like gaussian version of SSIM, as well as looking at the means of the standard deviations of the SSIM image gradients from the gaussian weighted images. From there, used dimensional reduction techniques to extract which of those metrics was most relevant, to try and build a model, a predictive model and then convert that into some of that metric that could be easily interpreted on the manufacturing while running production. One of the follow-ups on this is when doing the metric extractions, the reference image exploits the image that used a kernel with a pixel radius at five, and then the 7, 10, and 15 kernel radius is prepared to that image. This shows a little bit more detail about what the analysis pipeline approach looks like. The offline approach, we start with the image acquisition. Obviously, this is done during the manufacturing process. After that, initially, this work was done Python, which was to convert the grayscale crop, de noise, the images, and then feature generation metric extraction, SSIM, mean square, entropy, calculations, etc. JMP JSL has a Python wrapper that you can use for integration. What's great is you can actually move a lot of this code in to the JMP environment. I'll go into a little bit more detail about what that is and how I use it to help speed this process along. The final component of this was to develop a quality metric. This is very JMP-intensive step as well. Then once a metric is determined, you move that over into an online process in which, the manufacturing data is analyzed, and then that information is pushed to a SPC system. A little bit more about Python/ JSL Integration, the Integrated Python analysis code into JSL using Python/ JSL wrapper. The data transfer occurs by converting the JMP table into Pandas Data frame. The data frames are passed in an analysis code. The results are returned from Python to JMP as a data table. What's great about this is it enables using common image analysis libraries as skimage, scipy, PIL, OpenCV , etc, to perform the work while keeping the data within the JMP framework. This, in my opinion, was key because I'm not moving data from one system or one platform to another platform and back and forth. I was able to keep it all in one spot. This is a little bit more detail about what the analysis pipeline is doing when it's reprocessing the images. This is what I was able to move from just running slowly in Python to integrating it into JSL using, let's say, Python libraries. You can see, here's a bad, here's a good. We cropped the image to only show only take basically this component of it, convert it to gray, and then applying medium filters to blur the images with different kernel sizes as you can see as we increase the kernel sizes, they become a little bit blur. What I want to show takeaway here is for a good image, you can see the kernel with disk size 5 versus, 7 versus, 10 versus 15, look pretty similar to one another . Whereas disk 5, disk 7, disk 10, disk 15, on a bad image, you can see that by time you get to disk 15, this image, or kernel size 15, this image is starting to look similar to this. By removing the noise from this image, and then denoising the images incrementally using different kernel sizes. Th is what enables us to successfully extract metrics that we can use to differentiate between the good versus the bad. Sorry, hold on here. Let me go back up. At this point, I'm going to launch demo that shows how I'm moving this data into JMP and then what it looks like as it basically reprocessing this stuff. How I do that is the first thing I need to do is I need to build a data table. It's a great feature in JMP, you go to file, import multiple files here. Then what I've got is I've got basically a sample set of just five images that I'm going to combine all t ogether into one data table. What it does is it actually import the images as well as the image file names. You go here, click that file column name, click import, and what we get, is this. We've got an image of our picture that we want to analyze. We've got a file name tag next to it. We've got the data imported image. Now, the next step is to run our analysis code against each of these images and start extracting metrics. I've got a... modified version of that code here. The way this code works is there's a python initialization that occurs JSL. And then once that's done, the data table that you want to analyze is past eight to Python as a data frame. At that point under the Python submit, basically, you convert over to Python code. And then you're using standard approaches, standard coding procedures for Python and analyze the image, as you can see here, we've our library imports, etc. How high this code was, basically to find a bunch of functions. Then those functions will return an output that would basically get added back into the data frame that is pushed back out to JMP into the data table. That's done by using this lambda function, th is lambda function applies this fu nc tion all rows, this specified column in the data frame or the JMP table. When we run this... What we're going to do is take that table I just built and we're going to extract, but what we're going to do is reprocess the images and extract the metrics. What we see here is the images that are showing earlier. We've taken our picture, we've converted it to gray. Then we started learning it. We've got 5 pixel radius kernel 7, 10 and 15. Then from there, we're extracting all of our different metrics. SS IM for 7 compared to 5 SSIM. SSIM is structural similarity index, 7 compared to 5, 10 compared to 5, 15 compared to 5. That's basically what we're doing, is calculating structural similarity index between this image and this image, this image and this image and this image and this image. If we go back to our presentation, from here, I used several different approaches for interrogating the data. First was multivariate lots correlations to work with here. The red indicates bad images, blue indicates good images. An outlier analysis, just looking at the Mahalanobis distances, Jackknife distances. That's just to see whether or not we're heavily skewed, you can see that we've got more outliers in the bad population compared to the good. But I think if you separate these two out, this comes down a little bit. I'm showing this just to show you can go in here, and you can start trimming some of these out if you want to, if you're not seeing good fits. But for the purpose of this exercise, I didn't have to remove or hide any data in the analysis. I was able to use all as it is. But how I did that was here. This is the main data table. This is all the data that I looked at, here all the images. I would just go to analyze, multivariate methods, run the multivariate. What we want to do is just go with that, actually, sorry, in the second here. When classifying, I did use conversion on the classification from a character base t o turn. There's a great feature in here. We can go to column, utility, and then make indicator columns. You can go here and click append column name. What it does is a little say, "Okay, that is one, zero." For this row, this is bad, not good. Then, down here to go vice versa. You can use that information to feed into some of the different modelling platforms that we're going to look out here in a few minutes. Back to the multi variate. Going here, going to run all these. Let's run this platform. That's interesting. Let's go here, my bad. W e can see our correlations. Then to get the outlier analysis, go here, click, sorry, go into the multi variate and you go down to outlier analysis. You run Maha lanobis Distances, you run Jackknife Distances . There's different approach is based on. I've had to get successful for this. This is first pass. Looks like we've got something to work with here. The next step is to break these down a little bit more to look at how each of the different metrics is responding to different kernel sizes. What we're really looking for here is how to get... We start to see larger changes in the kernel sizes. Are the lines still overlapping with one another? We're starting to see separation. Looking at 15 on bad , versus 15 are good, 15 are bad, 15 are good for structural similarity in MSE and we can start to see that, we're seeing separation here. I think that is a good indicator that we should be able to extract something from this. That was done using the graph builder demo. I've got data table built specifically for this graph builder, great feature here. I'm dumping in kernel size. I'm looking the value the network, these into groups based on good versus bad classification codes. Then my image metric is going to use page. Here I just go create a linear fit that look at advanced for prediction , R². It will just pop the equation is there. You can see the goodness of the fit. I do like looking at cheer points. For this, when I used centered grid and move this along. It's just easy around the eyes for presentation purposes. You can scroll through here, and it's got for each of these really nice looking graphs to preview. Based on the multi variate and this linear fit, I'm feeling pretty good about the data that we extracted from these images. The next step is, we've got a lot of metrics that we've extracted from these. Do all of it matter, do some of it matter? The next couple slides, we go over dimensional reduction. Specifically, I used a PCA and Partial B squares for this PCA. With this is classification, bad versus good. This goes back to what I was demoing earlier and the data table. When you're building a PCA, you want to have, obviously, your inputs and your outputs recast as vectors in that analysis. The reason why you want that is because you're looking for non-o rthogonality with regards to your inputs and your outputs. The more orthagonal, they are to one another, the less likely they're playing a role in what's going on. Recasting a character- based classification into a number makes that easy to work with. We could see here, we've got some here that look like they could be interesting. Some of these are interesting. I highlighted these to run this analysis and simply took my data table that I built, go here, analyze, predictive modeling. Sorry, multivariate, principal components. What we're going to do is cast good, bad classification, and all of our different metrics into the analysis. See here, these look interesting. These were particularly interesting. What's nice is when you highlight these, and you go back and you look at your data table, it'll highlight which problems you selected. Oftentimes PTs can get really busy, so it's handy to be able to go references, and say, "Okay, so these are the ones that I'm actually interested in." Let's get back to here. After this, in a PLS, this is more for targeting how many factors we think are playing a role in the data that we've got, as well as helping us identify additional metrics that might be interesting. For this particular data set, it looks like it settles in and about probably minimum of five factors that we would expect that comes from this prob greater than van der Voet T² metric based on the job manual, they referenced a academic research that indicates that anything goes about 0.1 is typically where you want to start at. From there, we run the PLS factor identification. This is your VIP versus Coefficient plots, and from there, you can highlight the different metrics that seem to be driving this, the ones that are most relevant. The way that works looking at these charts is you want to go out to the extremes. These are interesting. As it gets closer, closer to the zero coefficient, it's less interesting. I'm not going to run the PLS platform because this will actually take a while to run. I don't think anyone wants to just watch my computer run while doing this. Based on our dimension reduction, I ended up with these factors being the most relevant with regards to what it is that we want to feed into the neural network. Some experimentation, I found that it was mainly the PLS results, and then the standard 10 metric from the PCA as well when that was added in, it really helped pull the model together. Training versus Validation. Obviously, I'm using a neural network to build the model for the classifier. The reason why I'm using that platform specifically for this, it has K-Fold validation booked in on its standard in the JMP platform, which is great. I've found that K-F olds helps to compensate when you're teetering on class imbalance. That's a really common problem when you're dealing with, studying manufacturing processes. Y ou're not going to have a lot of the defective data to work with. You've really got to make the best of what you've got. I find that sometimes K-Folds helps with that. The output from the neural network was pretty good, actually. The way I do this is I'll run this model, like about five times, and then I'll compare all five of them to one another. I'll take basically the median, this classification rate, and then go down here and look at our false positives and false negative rates. With regards to, the false positive rates. I look at the ROC Curve. This is a secret operator characteristics curves. You want to look at these, they want to be described as high and tight, the closer these lines get to this middle line, the closer your getting to the coin pass and your the prediction. Obviously, that is not what we're seeing here. Then the prediction profiler has some great features because you can basically reorder these based on which are playing the biggest role in the prediction, as well as visualizing, like what each individual metric is doing with regards to playing a role in the prediction itself. In this case, we're looking at transitions of the dominant dimensions here. Here's a little bit weird. I'm not sure what's going on with that. A s you get back here, where the less dominant dimensions, you can start to see they become more pushed. I'll run the demo for the neural network. We go here. Analyze. A ctually, I've got this one built already. Let me... We launch this analysis. We've included our factors, our responses are classification. Neural network doesn't care if you're using a numerical or character- based response. That will work with either one. Here, for validation method, you've got a K-Fold, using five folds. The gold standard for most like AI-type analysis is ten. I found this five was fine so no reason to push it. I use three hidden nodes. Hit go and then here's the output. Like I said, I'll run this probably five times, take the reading value you can see here. Our classification is pretty good. We'll see a little bit variation on the validation set. But it's all within reason for what it is that I'm looking for with regards to the profiler and the ROC curve. You get down here, which ROC curves you can see here, as I described earlier, this height characteristics on the curves. Validation, may be a little bit less, but we can work with it. Then with regards to the profiler. The profiler outputs looks like this. It doesn't arrange them and which one is most relevant. You go here and you go to assess variable importance, independent uniform inputs. Go back up here, we can see a summary report. Here, you go down to variable importance, menu. From here, you can reorder factors by main effect importance. For this model, it reordered this way, Colorize. You can see here we've got the darker color is most relevant and then it moves to the right. Let's see. Converting the model to an SPC metric, to make the data easier to interpret, we need to take the result from the model converted into metric that can be posted to SPC system and plotted on SPC chart. The activation component of the neural network model is a soft map function that's a standard in JMP. This calculates the probability of an image being good or bad, with the higher the two resulting and with the images classified as and what we're going to do here is going to introduce a third metric unknown. I'll show a little bit about how that is done. One of the nice things about the neural network platform is the SAS DATA Step output file is actually like a Pythonic model script. Typically for the metric system, you want this in python, that'll be really cool, forgot how to do it all in JMP. This code from the neural network here, the actual model itself can be outputed in this format. You just go to model and make SAS DATA S tep. Here we go. I think in JMP Pro, you could just output it by python directly, but not everyone has access to JMP Pro. It gets either 80 % of the way there, not 100 %. It just requires a few modifications in the actual code. On the left hand side, we've got the SAS DATA S tep format. On the right-hand side, it's the conversion of it to the python. It's just importing it as importing the math library, and then applying that so we can take the tangent to things. Then you get that here. What we want to do is threshold, our S oftm ax probability calculation we're doing this, rather than just saying, whichever of these, is highest is your actual results. If it's 0.5 1, bad, 0.49 , good, then it's going to be bad. Is we go and we say, "Okay, if the Softmax output is greater than 0.75 then it's that classifier, because it'll be equal and opposite." It'll be acceptable one. If it's 0.75 bad then it's bad, but if it's a number less than that, it won't classify it as good or bad. It'll just classified as unknown. We'll pass that into our system. What that tells us to do is someone look at those images, because the classifier can't decide whether or not it's good or bad. The output, and this is the finished product, is to convert, like I said, the final metric into something that could be reported in our SPC system. Basically, what we're doing here is just actually reporting what our Softm ax probability is, taking the inverse of that, and then applying the threshold to determine whether or not things are known or unknown. Based on that, this is our initial data that I've showed at the beginning of the presentation that shows this is the chart look like for the measurement of the defect. Then this is what we get when we go in and we reanalyze that data and apply this new metric, which tells us whether or not it thinks that there's side wall feature, you can see looking at the same data set that yes, it very much so identifies that as either good or bad. The green indicates that's a good. The red indicates those were classified as bad. Then you can see here on SPC Data, they would instantly get flagged as bad as well. These blue ones hanging out here are the ones that are unknown that it is telling us so much, "Go look at this." That is my presentation. I've posted the information to the user community site if anyone wants access to the code that's used for the image analysis that is there. Thank you very much.
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Accelerating Process Characterization with Agile Development of an Automation Tool Set (2022-US-45MP-1091)
Monday, September 12, 2022
Process characterization (PC) is the evaluation of process parameter effects on quality attributes for a pharmaceutical manufacturing process that follows a Quality by Design (QbD) framework. It is a critical activity in the product development life cycle for the development of the control strategy. We have developed a fully integrated JMP add-in (BPC-Stat) designed to automate the long and tedious JMP implementation of tasks regularly seen in PC modeling, including automated set up of mixed models, model selection, automated workflows, and automated report generation. The development of BPC-Stat followed an agile approach by first introducing an impressive prototype PC classification dashboard that convinced sponsors of further investment. BPC-Stat modules were introduced and applied to actual project work to both improve and modify as needed for real world use. BPC-Stat puts statistically defensible, flexible, and standardized practices in the hands of engineers with statistical oversight. It dramatically reduces the time to design, analyze, and report PC results, and the subsequent development of in-process limits, impact ratios, and acceptable ranges, thus delivering accelerated PC results. Welcome to our presentation on BPC- Stat: Accelerating process characterization with agile development of an automation tool set. We'll show you our collaborative journey of the Merck and Adsurgo development teams. This was made possible by expert insights, management sponsorship, and the many contributions from our process colleagues. We can't overstate that enough. I'm Seth Clark. And I'm Melissa Matzke. We're in the Research CMC Statistics group at Merck Research Laboratories. And today, we'll take you through the problem we were challenged with, the choice we had to make, and the consequences of our choices. And hopefully, that won't take too long and we'll quickly get to the best part, a demonstration of the solution BPC- Stat. So let's go ahead and get started. The monoclonal antibody or mAb fed- batch process consists of approximately 20 steps or 20 unit operations across the upstream cell growth process and downstream purification process. Each step can have up to 50 assays for the evaluation of process and product quality. Process characterization is the evaluation of process parameter effects on quality. Our focus is the statistics workflow associated with the mAb process characterization. The statistics workflow includes study design using Sound DOE principles, robust data management, statistical modeling and simulation. This is all to support parameter classification and the development of the control strategy. The goal for the control strategy is to make statements about how the quality is to be controlled, to maintain safety and efficacy through consistent performance and capability. To do this, we use the statistical models developed from the design studies. Parameter classification is not a statistical analysis, but it can be thought of as an exercise of translating statistically meaningful to practically meaningful. The practically meaningful effects will be used to guide and inform SME (subject matter expert) decisions to be made during the development of the control strategy. And the translation from statistically to practically meaningful is done through a simple calculation. It's the change of the attribute mean, that is the parameter effect, relative to the difference in the process mean and the attribute acceptance criteria. And depending on how much of the acceptance criteria range gets used by the parameter effect determines whether that process has a practically meaningful effect on quality. So we have a defined problem - the monoclonal antibody PC statistics workflow and study design through control strategy. How are we going to implement the statistics workflow in a way that the process scientists and engineers will actively participate in and own these components of PC, allowing us, the statisticians, to provide oversight and guidance and allow us to extend our resources. We had to choose a statistical software that included data management, DOE, plotting, linear mix model capabilities. Of course, it was extendable through automation, intuitive, interactive and fit- for- purpose without a lot of customization. And JMP was an obvious clear choice for that. Why? Because it has extensive customization and automation through JSL; many of our engineers, it's already their current go- to software for statistical analysis; we have existing experience training in our organization; and it's an industry- leading DOE with data interactivity. The profiler and simulator in particular is very well suited for PC studies. Also, the scripts that are produced are standard, reproducible, portable analysis. We're using JMP to execute study design, data management, statistical modeling and simulation for one unit operation or one step that could have up to 50 responses. The results of the analysis are moved to a study report and used for parameter classification. And we have to do all these 20 times. And we have to do this for multiple projects. So you can imagine it's a huge amount of work. When we started doing this initially, we were finding that we were doing a lot of editing of scripts before the automation. We had to edit scripts to extend a copy existing analysis to other responses, we had to edit scripts to add where conditions, we had to edit scripts to extend the models to large scale. We want to stop editing scripts. Many of you may be familiar with the simulator set up can be very tedious. You have to set the distribution for each individual process parameter. You have to set the add random noise, and you have to set the number of simulation runs and so on and so on. There are many different steps, including the possibility of editing scripts if we want to change from an internal transform to the explicit transform. The simulator is doing add random noise on a log- normal type basis, for example, the log transform. We want to stop that manual set up. Our process colleagues and us we're spending enormous time compiling information for the impact ratio calculations that we use to make parameter classification decisions. We are using profiles to verify those and assembling all this information into a heat map that then would be very tedious exercise to decompose the information back to the original files. We want to stop this manual parameter classification exercise. Of course, last but not least, we have to report our results. And the reporting involves copying, or at least in the past involved copying and pasting from other projects. And then, of course, you have copy and paste errors, copy and pasting from JMP, you might put the wrong profile or the wrong attribute and so on. We want to stop all this copying and pasting. We clearly had to deal with the consequences of our choice to use JMP. The analysis process was labor- intensive, taking weeks, sometimes months to finish an analysis for one unit operation. It was prone to human error and usually required extensive rework. It posed to be an exceptional challenge to train colleagues to own the analysis. We developed a vision with acceleration in mind to enable our colleagues with a standardized yet flexible platform approach to the process characterization and statistics workflow. So we had along the way some guiding principles. As we mentioned before, we wanted to stop editing JMP scripts so any routine analysis, no editing is required. And our JMP analysis, they need to stand on their own, they need to be portable without having to install BPC-Stats so that they live on only requiring the JMP version they were used. We collected constant feedback, we constantly updated, we constantly tracked issues relentlessly, updating sometimes daily, meeting weekly with Adsurgo our development team. Our interfaces, we made sure that they were understandable. We use stoplight coloring such as green for good, yellow for caution, and issues flagged with red. We had two external approaches or inputs into the system, an external standardization, which I'll show a bit later, where the process teams define the standard requirements for the analysis file. And our help files, we decided to move them all externally so that they can continue to evolve as the users in the system evolved. We broke our development into two major development cycles. Early development cycle, where we develop proof of concepts so we would have a problem, we would develop a proof of concept working prototype to address that problem. We would iterate on it until it was solving the problem and it was reasonably stable. And then we moved it into the late development cycle and continued the agile development, where in this case, we applied it to actual project work and did very careful second- person reviews to make sure all the results were correct and continue to refine the modules based on feedback over and over again to get to a more stable and better state. One of these proof of concept wow factors that we did at the very beginning was the Heat Map tool, which brought together all kinds of information in the dashboard and it saved the team's enormous amounts of time. I'll show you an example of this later. But you can see the quotes on the right- hand side, they were very excited about this. They actually helped design it. And so we got early engagement, we got motivation for additional funding and a lot of excitement generated by this initial proof of concept. In summary, we had a problem to solve, the PC statistics workflow. We had a choice to make and we chose JMP. And our consequences copying and paste, manual, mistakes, extensive reworking. We had to develop a solution, and that was BPC-Stat. I'm extremely happy to end this portion of our talk and let Seth take you through a demonstration of BPC-Stat. We have a core demo that we've worked out and we're actually going to begin at the end. In the end, which is that proof of concept Heat Map tool that we had briefly shown a picture of is when the scientists have completed all of their analysis and the files are all complete, the information for impact ratio is all available and collected into a nice summary file, which I'm showing here. Where for each attribute in each process factor, we have information about the profiler, it's predicted means and ranges as the process parameter changes. So we can compute changes across that process factor in the attribute and we can compute that impact ratio that we mentioned earlier. Now, I'm going to run this tool and we'll see what it does. So first, it pulls up one of our early innovations by the scientists. We organized all this information that we had across multiple studies. Now, this is showing three different process steps here and you can see on the x- axis, we have the process steps laid out. In each process step, we have different studies that are contributing to that, we have multiple process parameters, and they are all crossed with these attributes that we use to assess the quality of the product that's being produced. And we get this Heat Map here. The white spaces indicate places where the process parameter either dropped out of the model or was not assessed. And then the colored ones are where we actually have models built. And of course, the intensity of the heat is depending on this practical impact ratio. This was great solution for the scientists, but it still wasn't enough because we had disruptions to the discussion. We could look at this and say, okay, there's something going on here, we have this high impact ratio, then they would have to track down where is that coming from. Oh, it's in this study. It's this process parameter. I have to go to this file. I look up that file, I find the script among the many scripts. I run the script, I have to find the response and then finally I get to the model. We enhance this so that now it's just a simple click and the model is pulled up. The relevant model is pulled up below. You can see where that impact ratio of one is coming from. Here, the gap between the predicted process mean and the attribute limit is the space right here. And the process parameter trend is taking up essentially the entire space. That's why we get that impact ratio of one. Then the scientists can also run their simulators that have been built or are already here, ready to go. They can run simulation scenarios to see what the defect rates are. They can play around with the process parameters to establish proven acceptable ranges that have lower defect rates. They can look at interactions, the contributions from that. They can also put in notes over here on the right and save that in a journal file to keep track of the decisions that they're making. And notice that all of this is designed to support them, maintaining their scientific dialogue, and prevent the interruption to that. They can focus their efforts on particular steps. So if I click on a step, it narrows down to that. Also, because our limits tend to be in flux, we have the opportunity to update those and we can update them on the fly to see what the result is. And you can see here how this impact changed and now we have this low- impact ratio and say, how does that actually look on the model? The limit's been updated now, you can see there's much more room and that's why we get that lower impact ratio and we'll get lower failure rates. That was the heat map tool and it was a huge win and highly motivated additional investment into this automation. I started at the end, now I'm going to move to the beginning of the process statistics workflow, which is in design. When we work with the upstream team, they have a lot of bio reactors that they run in each of these runs. This is essentially a central composite design. Each of these runs is a bio reactor and that bio reactor sometimes goes down because of contamination or other issues that essentially are missing at random. So we built a simulation to evaluate this to potential losses called evaluate loss runs. And we can specify how many runs are lost. I'm just going to put something small here for demonstration and show this little graphic. What it's doing is it's going through randomly selecting points to remove from the design in calculating variance inflation factors, which can be used to assess multi linearity and how well we can estimate model parameters. And when it's done, it generates a summary report. This one's not very useful because I had very few simulations, but I have another example here. This is 500 simulations on a bigger design. And you can see we get this nice summary here. If we lose one bio reactor, we have essentially a zero probability of getting extreme variance inflation factors or non- estimable parameter estimates. And so that's not an issue. If we lose two bio reactors up to about 4%, that's starting to become an issue. So we might say for this design, two bio reactors is a capability of loss. And if we really wanted to get into the details, we can see how each model parameters impacted on variance inflation for given number of bioreactors lost, or we can rank order all the combinations of bioreactors that are lost, which specific design points are impacting the design the most, and do further assessments like that. That's our simulation and that's now a routine part of our process that we use. I'm going to move on here to standardization. We talked about the beginning of the process statistics workflow, the end of the process statistics workflow, now I'm going to go back to what is the beginning of BPC-Sta t itself. When people install this, they have to do a setup procedure. And the setup is basically specifying the preferences. It's specifying those input parameter, input standardizations that we had talked about earlier, as well as the help file, what process area they're working with, and default directory that they're working with. And then that information is saved into the system and they can move forward. Let me just show some examples here of the standardization files and the Help file. Help file can also be pulled up under the Help menu. And of course, the Help file is the JMP file itself. But notice that it has in this location column, these are all videos that we've created that explain and they're all timestamped and so users can just figure out what they're looking for, what feature. Click on it, immediately pull up the video. But what's even more exciting about this is in all the dialogues of BTC-Stat, when we pull up a dialogue and there's a Help button there, it knows which row in this table to go to get the help, and it will automatically pull up. If I click that Help button, it will automatically pull up the associated link and training to give immediate help. That's our Help file. The standardization. We have standardizations that we work with the teams to standardize either across projects or within a specific project, depending on the needs and for process areas. We had this problem early on that we weren't getting consistent naming and it was causing problems and rework. Now, we have this standardization put in place. Also, the reporting decimals that we need to use, the minimum recorded decimals, what names we use when we write this in a report, our unit, and then a default transform to apply. That's our attribute standardization. And then for our group standardization, it's very similar identifying columns, except we have this additional feature here that we can require only specific levels be present and otherwise will be flagged. We can also require that they have a specific value ordering. So let, for example, the process steps are always listed in process step order, which is what we need in all our output. Okay, so I'm going to show an example of this. Let me see if I can close some of the previous stuff that we have open here. Okay, so let me go to this example. So here's the file. The data has been compiled. We think it's ready to go, but we're going to make sure it's ready to go. So we're going to do the standardization. First thing is looking at attribute that recognizes the attributes that are erred the standard names. And then what's left is up here. And we can see immediately these are process parameters, which we don't have the standardization set up for. But we see immediately that something's wrong with this, and we see in the list of possible attributes that the units are missing. We can correct that very easily. It will tell us what it's doing, we're going to make that change. And then it generates a report, and then that stoplight coloring and says, oh, we found these. This is a change we made, pay attention to this caution. These are ones we didn't find. And this report is saved back to the data table so it can be reproduced on demand. And I'll go through the group standardization just to take a quick look at that. Here, it's telling me red, stop light coloring. We have a problem here, you're missing these columns. The team has required that this information be included. It's going to force those columns onto the file. We have the option with the yellow to add additional columns. And so we'll go ahead and run that, and it's telling us what it's going to do. And then it does the same thing, creates a report. And we look through the report and we notice something's going on here. Process scale. Our process scale can only have large lab micro. Guess what, we have a labbb. We have an extra B in there. So that's an error. If we find that value, correct it. Rerun the standardization and everything is good there. I did want to point out one more thing here. You'll see that these are our attributes, there are these little stars indicating properties. The properties that are assigned when we did the standardization is this custom property table deck. And that's going to pass information to the system about what the reporting precision is when it generates tables. Also, our default transformation for HCP was logged, so it automatically created the log transform for us. So we don't have to do that. Okay. That's the standardization, let's move on to a much more interesting things now. The PC analysis. Before I get to that, I just want to mention that we have a module for scaled- down model qualification. And essentially, it's using JMP's built- in equivalents testing. But it enhances it by generating some custom plots and summarizing all that information at a table that's report ready. It's beautiful. Unfortunately, we don't have time to cover that. I'm going to go now into the PC analysis, which I'm really excited about here. I have this. Standardization has already been done. We have this file that contains lab results, experimental design that's been executed at labscale. We have large scale data in here. We can't execute... It's not feasible to execute BOEs at large scale, but we have this at the control point. We want to be able to project our models to that large scale. And because we have different subsets and we have potentially different models, this one only has a single model, but we can have different models and different setups, we decided to create a BPC-Stat workflow and we have a workflow set up tool that helps us build that based on the particular model we're working with. I can name each of these workflows and I provide this information that it's going to track throughout the whole analysis. What is our large- scale setting, what are our responses? Notice this is already populated for me. It looked at the file and said, oh, I know these responses, they're in your standardized file, they're in this file, they exist. I assume you want to analyze this and they get pre- populated. It also recognize this as a transform because it knows that for that HCP, we want that on a log transform. And it's going to do it internally transform, which means JMP will automatically back transform it so that scientists can interpret it on the original scale. There are some additional options here. This PC block right now is fixed. In some cases, the scientists want to look at the actual PC Block means. But for the simulation, we're interested in a population- type simulation. We don't want to look at specific blocks, we want to see what the variability is. So we're going to change that PC Block factor into a random effect when we get to the simulation. And we're going to add a process scale to our models so we can extend our model to a large scale. The system will review the different process parameters and check the coding. If there's some issues here or a coding missing, it will automatically flag that with the stoplight coloring. We have here the set point. Very tedious exercise, annoying. We constantly want to show everything at the set point in the profilers because that's our control, not the default that JMP's calculating the mathematical center. So we built this information in so that it could be automatically added. And then we can define the subsets for an analysis. And for that, we use a data filter. I'll show here for this data filter and there's explanation of this in the dialogue. But we want to do summary statistics on a small scale. So I go ahead and select that. It gives feedback on how many rows are in that subset and what the subset is so we can double- check that that's correct. And then for the PC analysis, in this case, I have the model setup so that, of course, it's going to analyze the DOE with the center point but it's also going to do this single factor study, or what they call OFAT and SDM block, separate center points that were done as another block. And that's all built into another block in the study for that PC block. Lastly, I can specify subsets for confirmation points, which they like to call verification points, to check and to see how well the model is predicting. We don't have those in this case. And for what is our subset for large scale, that would include both the lab and the large scale data. Since it's all the data in this case, I don't have to specify any subset. Now, I have defined my workflow. I click okay, and it saves all that information right here as a script. If I right- click on that, edit it, you can see what it's doing. It's creating a new namespace. It's got the model in there, it's got all my responses, and everything I could need for this analysis. A s soon as you see this, you start thinking, well, if I have to add another response, I can stick another response in here. But that violates the principle of no script editing. Well, sometimes we do it, but don't tell anybody. What we did is we built a tool that has a workflow editor that allows us to go ahead back into that workflow through point and click and change some of its properties and change the workflow. I'm going to go ahead now and do the analysis. And this is where the magic really comes in. When I do the PC analysis set up, it's going to go take that workflow information and apply it across the entire set of scripts that we need for our analysis. And you see what it just did there. It dropped a whole bunch of scripts. It grouped them for us. Everything is ready to go. It's a step- by- step process, the scientists can follow it through. If there are scripts that are not applicable, they can remove those scripts and they're just gone. We don't worry about them. And then for the scripts that are present, we have additional customization. These are essentially generator scripts. And you can see it generates a dialogue that's already pre- populated with what we should need, but we have additional flexibility if we need it. And then we can get our report and we can enhance it as we need to, in this case, subsets I may want to include. And then resave the script back to the table and replace the generator script. Now, I have a rendered script here that I can use that's portable. Then for the PC analysis, we have data plots. Of course, we want to show our data. Always look at your data, generate the plots. There's a default plot that's built. And now the user, we only did one plot because we wanted the user to have the option to change things so they might go in here, say, get rid of that title. I just changed the size and I add a legend, whatever. You can change the entire plot if they wanted to. And then one of their all- time favorite features of BPC-Stat seems to be this repeat analysis. Once we have an analysis we like, we can repeat it. And what this is doing is it's hacking the column switcher and adding some extra features onto it. It'll take the output, dump it in a vertical list box or tab box, and allow us to apply our filtering either globally or to each analysis. Now, I'm in the column switcher and I can tell it what columns I want it to do. This works for any analysis, not just plotting. Click OK. It runs through the switcher, generates the report. There I have it. All my responses are plotted. That was easy. I go down and there's the script that recreates that. I can drop it here, get rid of the previous one. Done. Descriptive statistics. Here we go. It's already done. I have the subsetting applied, to have the tables I need. Look at this. It's already formatted to the number of decimals I needed because it's taking advantage of those properties that we had assigned, those unique properties based on this table standardization. So that one is done. And then the full model. Full model is set up, ready to go for what would you think? It's ready to go for residual assessment. We can go through each of the models one at a time and take a look to see how the points are behaving, the lack of fit. Does it look okay? Here, we have one point that may be a little bit off we might want to explore. Auto recalc is already turned on, so I can do a row exclude and it will automatically update this. Or we have a tool that will exclude the data point in a new column so that I can analyze it side by side. And then since I've already specified my responses, in order to include that side by side, I would have to go back and modify my workflow. And we have that workflow editor to do that. I'm just going to skip ahead to save some time where I've already done that. This is the same file, same analysis, but I've added an additional response and it's right here. Yield without run 21. Now, scientists can look at this side by side and say, you know what, that point, yeah, it's a little bit unusual statistically, but practically there's really no impact. All right, let's take this further. This is our routine process. We do take it all the way through the reduced model because we want to see if it impacts model selection. We have automated the model selection and it takes advantage of the existing stepwise for forward AIC or the existing effects table where you can click to remove by backward selection manually if you want, this automates the backward selection which we typically use for split pot designs. We also have a forward selection for mixed models which is not currently a JMP feature and JMP that we find highly useful. I'm going to go ahead, since it's a fixed model, I'm going to do that and gets the workflow information. I know I need to do this on the full model. It goes ahead and does the selection. What it's doing in the background here is it's running each model, it's constructing report of the selection that it's done in case we want to report that. And it's going to save those scripts back to the data table. There's that report right there that contains all the selection process. Those scripts were just generated and dumped back here. Now, I can move those scripts back into my workflow section. I know the reduced model goes in there and this is my model selection step history. I can put that information in there. Okay, so this is great. Now, when I looked at my reduced model, I have that gone through the selection. Now I can see the impact of the selection on removing this extra point here. And again, we see there's just basically, likely the scientists would conclude there's just no practical difference here. And they could even go, and should go and look at the interaction profilers as well, compare them side by side. This is great. We want to keep this script because we want to keep track of the decisions that we made, so there's a record of that. But we also want to report the final model. So we want a nice, clean report. We don't want that without running response in there because we've decided that it's not relevant, we need to keep all the data. Another favorite tool that we developed is the Split Fit Group which allows us to break up these fit groups We have the reduced model here. Take the reduced model. And allows us to break them up into as many groups as we want. In this case, we're only going to group it up into one group because we're going to eliminate one response. We want one group. When we're done, we're just using this to eliminate this response we no longer want in there. Click Okay. That's some feedback from the model fitting, and boom, we have it. The fit group is now there and without response analysis has been removed. Now we have this ready to report. Notice that the settings for the profiler, they're all the settings we specified. It's not at the average point or the center point, it's at the process set point, which is where we need it to be for comparison purposes. It's all ready to go. Okay, so that generates that script there when I did that split. I can put it up here and I can just rename that. That's final models. Okay, very good. Now, for some real fun. Remember we had talked about how tedious it is to set up this simulation script. Now watch this. Watch how easy this is for the scientists. And before I do this, I want to point out that this was, of course, created by a script, obviously JSL, but this is a script that creates a script that creates a script. So this was quite a challenge for Adsurgo to develop. But when I run this, I can pick my model here, final models, and then in just a matter of seconds, it generates the simulation script that I need. I run that, and boom. There it is, all done. It set up the ranges that I need for the process parameters. It's set them up to the correct intervals. It set the ad random noise. But there's even more going on here than what appears. Notice that the process scale has been added, we didn't have that in the model before. That was something that was added so that we could take these labs, scale DOE models, and extend them to the large scale. Now we're predicting large scales. That's important. That was a modification to the model. Previously, very tedious editing of the script was required to do that. Notice that we also have this PC block random effect in here we had specified because we don't want to simulate specific blocks, now it's an additional random effect. And the total variance is being plugged into the standard deviation for the ad random noise, not the default residual random noise. We also added this little set seed here so we can reproduce our analysis exactly. So this is really great. And again, notice that we're at the process set point where it should be. Okay, last thing I want to show here is the reporting. We essentially completed the entire analysis, you can see it's very fast. We want to report these final models out into a statistics report. And so we have a tool to do that. And this report starts with a descriptive statistics. I'm going to run that first, and then we're going to go and build the report, export stats to Word. And then I have to tell it which models do I want to export. It's asking about verification plots. We didn't have any in this case for confirmation points. So we're going to skip that. And then it defaults to the default directory that we set the output. I'm going to open the report when I'm done. And this is important. We're leaving the journal open for saving and modification. Because as everybody knows, when you copy stuff, you generate your profilers, you dump them in Word, and there's some clipping going on. We may have to resize things, we may have to put something on a log scale. We can do all that in the journal and then just resave it back to Word. That saves a step. So we generate that. I click okay here. It's reading the different tables and the different profilers, and it's generating this journal up here. That's actually the report that it's going to render in Word. And it will be done in just a second here. Okay. And then just opening up Word. And boom, there's our report. So look at what it did. It puts captions here, it put our table. It's already formatted to the reporting precision that we need. It has this footnote that it added, meeting our standards. And then for each response, it has a section. And then the section has the tables with their captions, and the profilers, and interaction profiler, footnotes, et cetera. And it repeats on and on for each attribute over and over. It also generates some initial text that the scientists can update with some summary statistics. And so it's pretty much ready to go and highly standardized. That completes the demo of the system. Now, I just have one concluding slide that I want to go back to here. So, in conclusion, BPC-Stat, it's added value to our business. It's enabled our process teams. It's paralyzed the work. It's accelerated our timelines. We've implemented a standardized, yet flexible systematic approach with that higher, faster acceleration, and much more engagement. Thank you very much.
Labels
(13)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Modeling of Product Quality Based on Certain Product Measurements (2022-US-EPO-1085)
Monday, September 12, 2022
This presentation highlights modeling approaches used to understand and predict product quality based on product measurements. The model development process involves four steps: data collection, model development, testing, and implementation. During data collection, parts are collected from the production line and product measurements are completed. These parts continue through the process and are subject to a quality test. These product measurements and quality test results are combined into the dataset and used for modeling. The quality test is the response and product measurements are predictors. In model development, the data is examined thoroughly to ensure it is as clean as possible. Variable clustering and stepwise regression are applied to remove highly correlated input variables and select the important variables. The final step is to apply generalized regression using log-normal distribution and the adaptive lasso estimation method. The model must have an accuracy of greater than a certain metric that is acceptable. If the model meets this criterion, it’s moved into the testing phase. This phase involves using the model under engineering control to determine how well the model predicts the product quality. Once satisfied, the model is implemented into production. The operations team receives instant feedback on how the part will perform and can adjust and tune the process in real-time. With this information, we can also deem the product acceptable or not. If rejected, the product is disposed of and doesn’t continue through the process. These predictive models identify unacceptable parts and process upsets in the upstream processes. Welcome, and thank you for joining my poster presentation at this year's JMP C onference. My name is Kaitlin Shorkey , and I'm a senior statistical engineer at Corning Incorporated. How do you get a glimpse of a product quality before it completes the production process? We chose to build a model that will predict the product quality outcome before it has completed the entire process. There are two major benefits of this approach. One is the operations team receives instant feedback on how the parts will perform and can adjust the process in real time. The second is that we can dem the product acceptable or not. Like I just mentioned, the main objective of this work is to build a predictive model using a few modeling approaches to understand and predict product quality based on certain product measurements. Our major steps in building this model are data collection, model development, testing and implementation. First off, for the data collection phase, parts are collected at the end of the production line and appropriate product measurements are completed. The parts are then subjected to the quality test. The product measurements and quality measurement results are combined into a data set and used for building the model. In this case, the quality measurement results are combined into a data set and used for building the model. In this case, the quality measurement is the response and all the product measurements are the predictors. The dataset consists 767 predictors and 990 observations or parts. This stuff can take a long time to execute. Since we're building a model, it's important to get as large of a range of product measurements and quality measurement results as we can. If we leave this the accuracy and model predictions are more consistent across the range. Essentially, this allows the model to accurately predict at all levels of the product quality results. Once the dataset is compiled, it is thoroughly examined to ensure it is as clean as possible. After the data collection and cleaning, the second phase of model development is started. For this, we begin with variable clustering. Step by regression, remove highly correlated variables and select the most important ones. With so many predictors we first apply variable clustering. This method allows for the reduction in the number of variables. Variable clustering groups the predictors, the variables into clusters that share common characteristics. Each cluster can be represented by a single component or variable. A snippet of the cluster summary from JMP is shown, which indicates that 85 % of the variation is explained by clustering. Cluster 12 has 49 members, and V 232 is the most representative of that cluster. The variables that are identified as the most representative ones are then used in the next method of stepwise regression. Stepwise regression is used on the identified clustered variables to select the most important one to us in the model, and further reduces the number of variables. For this, the forward direction and minimum corrected AIC stopping rule is used. The direction controls how variables enter and leave the model. The forward direction means that terms are entered into the model that have the smallest p-value. The stopping rule is used to determine which model is selected. The corrected AIC is based on negative two, law of likelihood, and the model with the smallest corrected AIC is a preferred model. From this, 51 variables are entered into the model of the 99 available variables from the variable clustering step. At this point, we have reduced the number of variables from 767 to 51 using variable clustering and stepwise regression. The final method is to fit a generalized regression model. For this, the law of normal distribution is used with an adaptive lasso estimation method. For this, the long normal distribution is used with an adaptive lasso estimation method. The law of normal distribution is the best fit for the response, so is chosen to use in the regression model. The adaptive lasso estimation method is a penalized regression technique which shrinks the size of the regression coefficient and reduces the variance in the estimate. This helps to improve predictive ability of the model. Also the data set was split into a training and validation set. The training set has 744 observations, and the validation set has 246. From this, the resulting model produces a .81 generalized R-square for the training set and .8 for the validation set. These R-squares are acceptable for our process that we will now evaluate the accuracy and predictability of the resulting model. Now that we have a model, we need to review its accuracy and predictability to see if it would be suitable to use in production. In doing this, a graph is produced that compares a predicted quality measurement for a specific part to the actual quality measurement. In the graph the xx shows the predicted value, and the yx shows the actual. Also, the quality measurement is bucketed into three groups based on its value, which is shown by the three colors on the graph. In general, the model predicts the quality measurement well. It does appear that the model may fit better in the lower product quality range than the upper, which may be due to more observations in the lower range. As mentioned, the quality measurement was bucketed into three different categories based on its value. This was also done for the predictive quality measurement. For each observation, if the quality measurement category is the same as a predicted measurement category, it is assigned to one. If not, it is assigned to zero. For both the training and validation sets, the average of these ones and zeros is calculated and is used as the accuracy to measure. We see that training set has an accuracy of 87.5% and the validation set has an accuracy of 84%. For the model to be moved to the testing phase, accuracy must be above a certain limit, and both of these accuracy values are. This will allow us to move to the testing phase of the project. In addition, we look at the confusion matrix to visualize the accuracy of the model by comparing the actual to the predicted categories. Ideally, the off diagonals of each matrix should be all zeros, with the diagonal from top left to bottom right containing all the counts. The matrices show on the poster that the higher counts are along that diagonal with lower numbers in the off diagonal, but discrepancies still exist among the three categories. For example, in the training set, there are 29 instances where the actual quality measurement of the three was predicted as a two. In the same case for the validation set, there are 12. The confusion matrix helps to understand where these discrepancies are so further investigations can be done and improvements made. Overall though, the model has an accuracy above that requires limit, where our next step would be the testing and implementation phases. Now that our model is through the development phase, it's time to test it in live situations. For this, the model is used under engineering control to determine how well it predicts the quality measurement in small, controlled experiment. This is done by the engineering team with support from the project team when necessary. Once the engineering team is satisfied with this testing, the model is fully implemented into production and monitored over time. In conclusion, this model development process has allowed us to build predictive models for the production process. The methods of variable clustering, stepwise variable selection in generalized regression were the most appropriate and best students to use for this application. With further research and investigation, other methods could be potentially applied to improve model performance even more. From a production standpoint, the benefit of this model is that the operations team will receive instant feedback on how a part or group of parts will perform, and can ingest the tune and tune the process in real time. We can also deem the product acceptable or not. If rejected, the product is disposed of and will not continue through the process, which over time reduces production costs. Lastly, I'd like to give a huge special thank you to Zvouno and Chova and the entire project team at Corning Incorporated. Thank you for joining and listening to my poster presentation.
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Evaluating Which Plays and Players Provide the Best Opportunity for a Comeback in the NBA (2022-US-30MP-1167)
Monday, September 12, 2022
The National Basketball Association (NBA) has long been a league where impressive events happen on a regular basis. Because of this the standard for noteworthy achievements has been raised time and time again. You will frequently see social media posts from top sports accounts regarding players scoring 30+ points or incredible dunks. However, you rarely see a post about a comeback victory unless the team overcame the insurmountable odds of being down by, what appears to be, too many points to win. This presentation considers the individual plays, players, and salary figures that give a team the best chance at achieving a comeback victory. Descriptive analytics are used to gain valuable insights into how often a team can produce a comeback when trailing by ten or more points at half, which teams are achieving them most often, which players are most involved in producing comebacks and the salaries of those players. The focus for the predictive portion of the analysis is on predicting the involvement of players based on their career stats and salary figures and the types of plays NBA teams can prioritize to achieve comeback victories more often. Today we're going to talk about what provides the best opportunity for a comeback in the NBA. So my name is Weston Salmon. I'm currently a student at Oklahoma State University studying for Business Analytics and Data Science in our Masters program. And my name is Zach Miller. I'm also a student at Oklahoma State University and also studying for my Masters in Business A nalytics and Data Science. All right, so we're going to cover the table contents real quick. that shows what we're doing throughout the presentation. First we're going to begin with an introduction that discusses why we're here and what exactly the study is for. Then we're going to jump into our data and methods, and that looks at what does the data look like, how did we complete our analysis, and the different ways we manipulated our data. We'll then look at the descriptive and predictive proportions that show what can we derive from the data as it sits and then what do our predictions look like. Then at the end, we'll conclude the presentation and look at what are the implications of the analysis and how it can be used by NBA teams in the future. Here's a quote by Gabe Frank. He's the Director of Basketball Analytics with the San Antonio Spurs. We thought this would be a good quote to throw in as it deals with [inaudible 00:01:16] and also the NBA is in general. So he said, "I think analytics have grown in popularity because it can give you a competitive advantage if you do it well. Every little bit helps." Through our presentation, we're going to discuss NBA analytics and how they can produce come backs based on the data that we find. We thought this quote really spoke to the overall objective of our project. Now I'll pass it off to Zach to i ntroduce the project as a whole. Thanks, Weston. Going into the introduction here, going into a NBA season, every team has one common goal, and that's to win the championship. Like I said, going into the season, a lot of teams hope for 40 plus wins in their 82- game season, but a great season typically results in 50 plus wins. Then also our primary interest for this presentation and analysis is in the hard fought victories or also known as a comeback victory by these different teams. We define a comeback victory as the winning team losing by 10 plus points at halftime. At halftime being down by more than 10 points by the end of the game, we were seeing that every so often there were teams that were taking a lead and ultimately winning this game. Then finally, we have our analysis that utilizes play- by- play data and salary data sets, which we will go into a little more detail in just a few minutes. Now we're going to discuss the two business questions that we want to answer using both the data. First, looking at the play- by- play data, we want to know exactly what plays or sequence of plays within the game give the team the greatest chance at a comeback victory. We want to know what exactly players and coaches can do and draw up in the line up to produce a comeback. Then with the salary and career stats data we really want to see how those variables, the salary and career stats, can be used to determine how involved a player should be in comeback victories. So not necessarily just how well they perform as they actually perform, but also how they should perform based on these variables. So which players are underperforming and over performing according to their contract and track record? Next we'll discuss the data and methods that we used for both the data sets. Like I said, we ha d two data sets. The first one focused on play-by-play data. This data contains the process and outcome of every play within every game from 2015 through 2018. So it included all 30 NBA teams and exactly what they did in every single play throughout the games throughout these years. Zach will also talk about the salary data that we have. Right, so going into the salary data, it contains the salary information for each of the players that were mentioned in the play-by-play data. Whether that player that was mentioned in the play-by-play data played one minute of NBA time or hundreds of minutes of NBA time, they were appearing in my salary data. I could then go and see what their career stats and what their salary was for the seasons that we were looking at within the play-by-play data. Now we're going to look at the key variables that we found within the play-by-play data set. You can see things such as comeback, half time deficit, the shot distance, outcome type, and rebound type. But two that we want to focus in in particular was the comeback. That's a variable [inaudible 00:04:51] it was our flag variable and used as our predictor. We flagged the one next to all games where a team trailed by 10 or more points at halftime and came back and won and it was a zero if not, because as we said, the overall goal of this presentation is to see what leads to come back in general. Then also we want to look at the halftime deficit, which was another variable that we created. This shows the number of points, the certain team trailed by at halftime. If the deficit was greater than or equal to 10 points, then those were the games that we specifically looked at, and then we want to see if the plays made throughout those games led to a comeback in the end. Now looking at some of the key variables for the salary data. These as a whole are variables that we decided were going to be important for our analysis. But once again, I wanted to focus on a few of these or a couple of these variables as I feel that they are more important to point out and explain. First of which being the player involvement variable. The player involvement variables account of individual involvement on key plays during comebacks. These plays could include shots, rebounds, fouls, any of the actionable plays that we see throughout the play-by-play data. So we wanted to take individual accounts from players so we could see how many times a certain player was shooting the ball throughout the seasons, and really be able to compare these players to other players within the league Then going to come back score. This is a min- max scoring method that we use to score the overall player involvements. This is what we use to really quantify how involved these players were. This is utilizing the player involvement variable that you see that I just explained. I wanted to go a little bit deeper into the comeback score calculation just to make sure that everyone understands how this was calculated. As I said, it was a min- max scoring method, and this was used to determine the involvement of the players during their team's comeback victories. This min- max method creates the scores, taking the players involvement into account, relative to the range of values that appear for each variable. It would take the maximum count of these different plays and it would use that as the maximum and then a minimum of typically what we found would be zero as certain players would only play a very low amount of minutes from zero and all the way up to hundreds of minutes. But typically with the zero minute players, we found that they did not contribute much to these comeback wins. Below you see the formula that we used for each of the players to create this comeback score. This is a perfect example as we see the assist count divided by the maximum assist times . 1667, which .1667 being 1 divided by the total of the 6 included variables, which is what we would call the weight for the formula. Each of these variables was weighted equally and we took the min- max score for each of the variables and multiplied that by 100 to get the final score. We'll now look at the play- by- play analysis method. When taking this data, we first began by merging. We had six CSV files, one that identified each individual year, and we combined all those into one central file so we could look at each play-by-play data from the six years that we had. We then transformed the data using flag variables. As we said, we created a column that specified whether there was a comeback or not. We first looked at the halftime scores and saw if teams were trailing by 10 or more at halftime. We would then take those games and then see if a comeback actually occurred. If it did, we flagged one and specifically we'll get those plays that occurred. Then for the descriptive analysis, we looked at different graphs within Tableau. These included things as how far away the players were shooting from the basket, and whether they were missing or faking their shots, the rebound types, a nd things like that to get an idea of what players were doing during the games, if they were actually producing good outcome to secure a comeback. Then lastly, the predictive analysis we did in J MP Pro using a decision tree to see which plays and sequences of plays produce the come back and how we can better look at those in the future to then have teams be able to produce more comebacks throughout the season. Now for the methods with the salary analysis. First off, we had to do some table joins. These joints were necessary to get all of the data tables together as we needed them all together to really be able to dive into everything as a whole Separated it wasn't too much help for us. Then we've moved on to some data transformation. We wrote SQL queries to gather the counts of the key metrics. This is how we got the counts of shots for the various players along with other things such as rebalance or fouls. Then we moved on to some descriptive analysis that was completed in Tableau. With this descriptive analysis, one of the key things that we were looking at was the comeback scores, the actual comeback scores, and the predicted comeback scores versus the salary of the players. So we could see just how well they're performing relative to their salary. Then finally we had a predictive analysis. We did a linear regression that was completed in JMP Pro. I will go into a little bit more detail about that a little bit later on. Now we're going to jump into those descriptive and predictive analysis that we conducted. We're going to begin with the descriptive analysis first. Here, we want to look at the salary versus comebacks by each NBA team. If you look at the data points, you can see that most teams follow the trend line, meaning that as they spend more money on their teams and salary, they also produce a greater number of comebacks. So you can see that the Boston Celtics had the most comebacks at 14 towards the top, and then the Cleveland Cavaliers have the highest salary paid, but also one of the fewest comebacks with only five comebacks. What we thought was the most interesting was the Indiana Pacers, because not only did they pay such a low salary, but they were also able to produce 12 come backs which is the third most throughout the NBA. I wanted to hone in on the Indiana Pacers and see what exactly they were doing that allowed them to produce such a high number of comebacks with such a low salary rate. As Weston said, we wanted to focus on the Indiana Pacers. Here we see the salary of Pacers' players versus their individual comeback scores. Several highly scored players are found within the Indiana Pacers roster, as you can see with Myles Turner, Carlson, Young, George, and Oladipo. The top scored players are spread across the salary spectrum. So you see some cheap players such as Myles Turner or Carlson being more of a mid- range player, salary paid player. Then you also have more expensive players such as Paul George or Victor Oladipo further towards the top right of the graph there. So you can really see how they've spread the wealth out across and are getting maximum performance out of their highly paid players, but also finding performance out of lower paid players. You can also see that they have several middle tier players that come into play and provide big help to the Pacers as they need some players to come off the bench and be able to provide some key value plays and produce comebacks. As I said, one of the key points that I want to point out was that picture Victor Oladipo— the highest paid player on the team— is also the highest performing in terms of comeback score, so they're definitely getting their worth out of him as a player. All right, so now we're going to get into our predictive analysis. To begin with the play-by-play data, we decided to make a decision tree to predict the play type that leads Indiana Pacers producing a come back using the following variables that you can see below. We'll see that only a couple of these variables actually played a huge impact in predicting whether the Pacers will come back from 10 or more points. In the decision tree, there are two nodes in particular. One where the distance shot was greater than or equal to 26 feet from the basket, and they were making those, as well as having a shot distance of greater than or equal to 3 feet from the basket, meaning that they're looking at more of a layout option. Now we're going to look at those two nodes a little bit more in particular. These branches, as I said, predict that the Pacers produce comeback victories. In the overall model we had a validation misclassification rate of 45.97%. As I said, the model predicts that made shots of 26 feet and further made lay ups 3 feet or further from the basket leads to comebacks. We would say is, they should really focus on the three-point aspect and more higher percentage shootings such as lay ups, because as you can see in both of those, the prediction was one, which in this case means that the Pacers were able to produce a comeback. You can see that with the 26 feet and further node, the probability that it equalled 1 was 62.75%, Then when we were shooting lay ups 3 feet or further from the basket. you had a probability that you would win of 75.9%, or come back at 75.9%. Then as I said, there were two variables that seem the most important of the 10 that we looked at in predicting why the Pacers were able to produce a comeback that was first shot distance. Which looked at how far players shot the ball, and then also shot outcome. That's whether they made or missed the shot. With the distance, as I said, 26 feet or further, which is about the three- point range, or some of those higher percentage shots in the play for a lay up. Then also if you're making more shots, you're producing a higher score giving you a better chance of coming back. All right, so now moving into our linear regression portion of our predictive analysis. This regression, as I said, was completed in JMP Pro, and this was done to predict the comeback scores of individual players based on the following variables that you see there on screen. A couple of the key ones to point out would be their individual player salaries, their team name, and then their career statistics, as you see with all those different variables there. It's also important to note that the variables were selected for this regression based on their level of significance. If the variable was not found to be significant, it was not included in the regression. Going into the summary of fit for this linear regression, I do want to point out that it does have a low RSquare, but this is not a primary concern for our analysis. We knew that the comeback score would be based on the comeback involvement statistic, but we now wanted to know what the score would be based on completely different variables. So instead of using the variables that we used to create the statistic initially, we're now using new variables to try to predict what it should be based on, like I said, their salary and career stats. That means that the predictions would vary from the original scores and that was not only expected in our analysis, but it was also desired that we came up with different scores to really see how they were supposed to perform. Now, based on this analysis, we were able to come up with some of the most important variables. The first of which that we saw was most important was salary. Something that we were seeing is that higher paid players were predicted to perform more, which is something that you would definitely see more in the actual NBA. Seeing that players like Victor Oladipo or LeBron James with higher salaries paid to them would be performing better than those with lower salaries. Then moving on from there, we also have the team. This one definitely makes sense as you see that some of the top teams that it was looking at for a comeback victory and predicting the comeback scores is the Golden State Warriors and the Indiana Pacers, which is a couple of the teams that we saw had the highest number of comeback victories over the seasons that we were looking at. Then we also had a couple of career stats that really popped up and showed to me a couple of the most important variables for this regression. The first of which being the career total rebounds by the players, and then that was followed by the career points. Seeing that player had higher career-total rebounds and higher points, we expected those players to produce more value whenever it came to creating comeback victory. I'll also note that these important variables were calculated through the log worth. Now we're going to look at the conclusions of the presentation. Okay, so going into some of the Indiana Pacers predictions, specifically want to point out some Pacers' top performers and under performers. The blue dots that you see there are the actual Pacers' top performers that we saw in the earlier graphs of the actual predicted comeback score versus the salary, whereas now we are looking at the predicted score or their actual comeback score, sorry, versus the salary, and now we are looking at the predicted score. The the orange Xs mark the Pacers' underperformers. The underperformers in this graph with the orange Xs, we are seeing them predicted to be relatively much higher than their teammates, whereas with their actual scores, they are finding themselves more middle-to-lower-end of the pack relative to their teammates, which really shows us that they're not performing up to what their salary and career statistics say that they should be performing particularly when it comes to creating a comeback victory. But it is important to point out that the team has done a great job of signing inexpensive players that produce comeback wins. We see those players such as Myles Turner or Carlson, or Young that have a little bit lower salaries, but they also produce a lot of plays that can help with creating a comeback win. Then we also wanted to point out some of the Cleveland Cavaliers predictions and their faults that go with them. The Cavs should have multiple high- tier comeback players. One specifically to point out would be Kevin Love. Kevin Love is there at the top of the graph and he has both an orange X and a blue mark next to his name. That just marks that he was one of the actual top performers for the Cavaliers, but at the same time, he's under performing greatly. So in our predictions, we can see that he's predicted to actually perform better than LeBron James, which is something that is very interesting to point out. Like I said, with our predictions based on salary and their career stats, we would expect Kevin Love to outperform LeBron James when it came to producing comeback wins. But in reality, he's actually quite far down the list and he still remains one of the top performers, but he does not produce nearly as much as LeBron James does. Now we also wanted to look at the best valued players in NBA. We show the top five here. Looking at their predicted comeback scores, you have people such as Karl-Anthony Towns, Joel Embiid, and Ben Simmons who were predicted to be some of the higher performing players in the entire league. But as you can also see, when this data was taken, they had relatively low salaries compared to other players. What we recommend here is definitely giving these players the contracts that they are deserving of, as they help teams produce comebacks and obviously provide statistics that allow teams to perform their best. They're definitely doing more for what they're actually worth. Then we also wanted to look at the best line of predictions for the Pacers. As I mentioned earlier, that three point in high percentage emphasis. So build up the lineup of shooting threats from distance. You have people such as Robinson, Joseph, and George their average shot distance is about 15 feet and beyond when the three- point line is about 25 feet, so that shows that they are shooting a lot of threes, but they're also making it. Not only are they shooting from that far, but they're also more likely to make their shots, so those people would be good to have in the line up whenever you are trying to produce a comeback as they're more efficient. Also because they can shoot from deep, you'd expect that they also have a solid play down low to be able to get a lay up real quick and get those higher percentage shots go in as well. As I mentioned, an average distance of made shots near the three-point line is very important for the Pacers in particular to be able to produce a high number of comebacks. This analysis confirms what is already going on in the NBA. Typically, teams who find themselves down by a certain number at halftime will throw up a little bit more three- point shots, but also they don't really focus on that high percentage look just from down low into the basket. We also think that they should focus on drawing up plays, allow them to just get a quick lay up and build momentum upon that as they try to produce a comeback later on. All right, so that wraps up our presentation. We just want to say a quick thank you, and this is where we would open it up to questions.
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Introducing Limits of Detection in the Distribution Platform (2022-US-EPO-1164)
Monday, September 12, 2022
When taking measurements, sometimes we are unable to reliably measure above or below a particular threshold. For example, we may be weighing items on a scale that is known to only be able to measure as low as 10 grams. This kind of threshold is known as a "Limit of Detection" and is important to incorporate into our data analysis. This poster will highlight new features in the Distribution platform for JMP 17 that make it easier to analyze data that feature detection limits. We will highlight the importance of recognizing detection limits when analyzing process capability and show how ignoring detection limits can cause a quality analyst to make incorrect conclusions about the quality of their processes. Hi, my name is Clay Barker, and I'm a statistical developer in the JMP group. I'm going to be talking about some new features in the distribution platform geared towards analyzing limit of detection data. It's something I worked on with my colleague, Laura Lancaster, who's also a stat developer at JMP. To kick things off, what is a limited detection problem? What are we trying to solve? It is basic level, a limit of detection problem is when we have some measurement device and it's not able to provide good measurements outside of some threshold. That's what we call a limit of detection problem. F or example, let's say we have, we're taking weight measurements and we're using a scale that's not able to make measurements below 1g . In this instance, we'd say that we have a lower detection limit of 1 because we can't measure below that. But in our data set, we're still recording those values as 1. Because of those limitations, we see, we might have a data set that looks like this. In part, we have some values of 1 and some non-1 values that are much bigger. We don't really believe that those values are 1s. We just know that those are at most 1. This data happens all the time in practice. If you think about sometimes we're not able to measure below a lower threshold. Sometimes we're not able to measure above an upper threshold. Those are both limited detection problems. What should we do about that? Let's look at a really simple example. I simulated some data that are normally distributed with mean 10 and variance 1 , and we're imposing a lower detection limit of 9. If we look at our data set here, we have some values above 9, and we have some 9s. When we look at the histogram, this definitely doesn't look normally distributed because we have a whole bunch of extra 9s and we don't have anything below 9. What happens if we just model that data as is? Well, it turns out the results aren't that great. We get a decent estimate of our location parameter, our Mu, it's really close to 10, which is the truth. But we've really underestimated that scale or dispersion parameter. We've estimated it to be 0.8, when the truth is that we generated it with scale equal to 1. You'll notice that our confidence interval for that scale parameter doesn't even cover 1. It doesn't contain the truth and that's generally not a great situation to be in. What's more, if we look at the… We fit a handful of distributions, we fit the log normal, the gamma, and the normal, well the normal distribution, which is what we use to generate our data, it isn't even competitive based on the AIC. Based on those AIC values, we would definitely choose a log normal distribution to model our response. We haven't done a good job estimating our parameters. We're not even choosing to use the distribution that we generated the data with. W e just threw all those 9s into our data set. We ignored the fact that that was incomplete information and that didn't work out well. What if, instead of ignoring that limit of detection, what if we just throw out all those times? Well, now we've got a smaller data set and it's biased. We've thrown out a large chunk of our data on it. We have a biased sample now. Now if we fit our normal distribution, now we're overestimating the location parameter, and we're still underest imating the scale parameter. We're actually in quite a bad position still, because we haven't done a good job with either of those parameters. We're still unlikely to pick the normal distribution. Based on the AIC, the log normal and the gamma distribution both fit quite a bit better than the normal distribution. We're still in a bad place. We tried throwing out the 9 s, and that didn't turn out well. We tried just including them as 9 s. That didn't turn out well either. What should we do instead? The answer is that we should treat those observations at the limit of detection. We should treat those as censored observations. Censoring is a situation where we only have partial information about our response variable. That's exactly the situation we're in here. If we have an observation at the lower detection limit, and here I've denoted it D sub L, we say that observation is less censored. We don't say that Y is equal to that limit of detection. We say that Y is less than or equal to that DL value. On the flip side, if we have a upper limit of detection, denoted DU here, those observations are right censored. Because we're not saying that Y is equal to that value. We're just saying it's at least that value. If you're looking for more information about how to handle censored data, one of the references that we suggest all the time is Meeker and Escobar's book Statistical Methods for Reliability Data. That's a really good overview for how you should treat censored data. If you've used some of the... If you use some of the features and the survival and reliability menu in JMP, then you're familiar with things like life distribution and fit life by X. These are all platforms that accommodate censoring in JMP. What we're excited about in JMP 17 is now we have added some features to distribution so that we can handle this limit of detection problem and distribution as well. All you have to do is you add a detection limit column property to your response variable, and you specify what the upper and or lower detection limit is, and you're good to go, there's nothing else you have to do. In my simulated example, I had a lower detection limit of 9. I would put 9 in the lower detection limit field here. That's really all you have to do. By specifying that detection limit, now distribution is going to say, okay, I know that values of 9 are actually left censored, and I'm going to do estimation accounting for that. Now with that same simulated example, and this lower detection limit specified, now you'll notice we get a much more reasonable fit the normal distribution. Now our confidence interval for both the location and scale parameter covers the truth, because we know, again, the location was 10 and the scale was 1. Now our confidence intervals cover the truth and that's a much better situation. If you look at the CDF plot here, this is a really good way to compare our fitted distribution to our data. What it's doing is that red line is the empirical CDF, and the green line is the fitted normal CDF. as you can tell, they're really close up until 9. And that makes sense, because that's where we have censoring. We're doing a much better job fitting these data because we're properly handling that detection limit. I just wanted to point out that when you've specified the detection limit, the report makes it really clear that we've used it. As you can see here, it says the fitted normal distribution with detection limits, and it lets you know exactly which detection limits it used. Now not only are, because we're doing a better job estimating our parameters, things like inference about those parameters is more trustworthy. If we do something like we look at the distribution profiler now we can trust these inference based on our fitted distribution, we feel much better about trusting things like the distribution profiler. With the simulated example, if we use our fitted normal distribution, Because we properly handled censoring, we know that about 16 % of the observations are falling below that lower detection limit. I also wanted to point out that when you have detection limits in distribution, now we're only able to fit a subset of the distributions that you would normally see in the distribution platform. We can fit the normal, exponential, gamma log, normal, WI and beta. All of those distributions support censoring or limited detection in distribution. But if you were using something like the mixture of normals, well, that that doesn't extend well to sensor data. You're not going to be able to fit that when you have a limit of detection. I also wanted to point out if you have JMP pro and you're used to using the generalized regression platform, generalized regression recognizes that exact same column property. The detection limit column property is recognized by both distribution and generalized . One of the really nice things about this new feature is that it gets carried on to the capability platform. If you do your fit and distribution, and you launch capability, now we're going to get more trustworthy capability results. Let's say that we're manufacturing a new drug, and we want to measure the amount of sum impurity in the drug. Our data might look like what we have here. We have a bunch of small values, and we have a lower detection limit of 1 mg. these values of 1 that are highlighted here, we don't really think those are 1. We actually think it's something... We know that it's something less than 1 . We have an upper specification limit of 2.5 milligrams. this is a situation where we have both spec limits and detection limits. It's really easy to specify those in the column properties. Here we've defined our upper spec limit as 2.5 And our lower detection limit of 1. Now all you have to do is just give distribution the column that you want to analyze. It knows exactly how to handle that response variable. Let's look at the capability results. Now, because we've properly handled that limit of detection, we trust that our log normal fit is good. We see that our Ppk, value here is 0.546 . That's not very good. Usually you would want a Ppk above 1. This is telling us that our system is not very capable. We've got some problems that we might need to sort out. Once again, what would have happened if we had ignored that limit of detection and we had just treated all those 1s as if they truly were 1s. Well, let's take a look. We do our fit, ignoring the limit of detection, and we get a Ppk of above 1. Based on this fit, we would say that we actually have a decently capable system, because a Ppk of 1 is not too bad. It might be an acceptable value. By ignoring that limit of detection, we've tricked ourselves into thinking our system is more capable than it really is. I think this is a cool example, because we have a lower detection limit, which may lead you to believe, well, I might be maybe ignoring the limit of detection would be conservative, because I'm overestimating the location parameter. That's true, when we ignore the limit of detection, we're overestimating that location parameter. But the problem is we're grossly underestimating the scale parameter. That's what makes us make bad decisions out in the tail of that distribution. By ignoring that limit of detection, we've really gotten ourselves into a bad situation. Just to summarize, it's really important to recognize when our data feature a limit of detection. I think it's easy to think of, sometimes we think about data sets where maybe we've analyzed the response as is in the past, when really, maybe we should have adjusted for a limit of detection. Because like we just saw, when we ignore those limits, we get misleading fits. Then we may make bad decisions based on those misleading fits. Like we saw in our example, we got bad estimates or the location and scale parameters, and our Ppk estimate was almost double what it should have been. But what we're excited about in JMP 17 is that the distribution platform makes it really easy to avoid these pitfalls and to analyze this kind of data properly. All you have to do is specify that detection limit column property, and distribution knows exactly what to do with that. Today we only looked at lower detection limits, but you can just as well have upper detection limits as well. In fact, you can have both. Like I said, there's only six distributions that currently support censoring inside of the distribution platform. But those are also the six most important distributions for these kinds of data. It really is a good selection of distributions. That's it. I just wanted to thank you for your time and encourage you to check out these enhancements to the distribution platform. Thanks.
Labels
(6)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Predictive Modeling and Machine Learning
Quality and Process Engineering
0 attendees
0
0
0 attendees
0
0
Easy DOE: Easy Enough for a Seven-Year-Old? (2022-US-30MP-1163)
Monday, September 12, 2022
JMP 17 introduces the Easy DOE platform, providing both flexible and guided modes to users, aiding them in their design choices. In addition, Easy DOE allows for the DOE workflow from design through data collection and modelling. This presentation offers a preview of the new Easy DOE platform, including insights from a seven-year-old using the new platform on a classic DOE problem. Hello. Thank you for joining us today. I'm Ryan Lekivetz, Manager of the DOE and Reliability team at JMP. And I'm Rory Lekivetz, rising second- grade student. Today , we're going to talk to you about Easy DOE. I posed the question easy enough for a seven- year- old. Now, for a lot of you watching, you may not have heard of Easy DOE before. It's one of our new platforms in JMP 17, and so it's base JMP so you don't need a pro to use it. The idea with Easy DOE is that we're going to have one file, one workflow that's going to contain both the design and analysis. If you're familiar with doing design experiments and JMP, you're used to going under the DOE menu, creating a designed experiment, and then making a data table from there. And then you would do all your data table on that analysis. And there was that separation between the design and analysis point. The idea with Easy DOE is that we're trying to aid novice users through the entire workflow. And so, unlike custom design, you're going to see a lot more hints and different defaults that are set to try to aid those users. And you're going to see that both on the design and the analysis side. In addition, you'll also see there isa flexible mode for those who are more comfortable with DOE. That's all safe. My purpose in this talk was really to see, is Easy DOE going to be easy enough to use? Well, I said seven- year- old, but I guess you're more like seven and a half now, is that right? The idea here was to let Rory do the steering throughout. That I wanted her to be the one using Easy DOE, putting everything in. I wanted to have as little input as possible, even when it came to decisions about what to do with the design, and things like that. Of course, I still did need to give her an introduction to DOE. I mean, if you've seen in our DOE documentation, this figure might look familiar. So we went through the different phases. We talked about what's the difference between a goal, a response and the factors. Specify. That's where we talked about what's going on with the model? And in particular, what's the difference between a main effect and an interaction? Why would you care about one or another? And we said, well, once we have the design, we actually have to go about collecting some data. Then we have to fit, then we have to do something with the model. We want to find out what's important and how can we use that model to do something further. I mean, you can imagine we spent maybe about half an hour just talking about some of those different things. I'll say, well, why don't we use some classic experiments ? My suggestions were, why don't we try the old paper helicopter experiment? O n our shelf we have the statap ult or the classic catapult experiment. And so what did Rory say? No. Rory's idea was actually to do paper airplanes. S he'd started to to try out flying some paper airplanes. S he said, well, I want to do a DOE with paper airplanes. I said, well, that's great, let's see what we can do. Luckily, she knew the classic paper airplane which is what you'll see is called a Dart. But we found a website that had different instructions for that different types of paper airplanes you could make. And on top of that, it had suggestions for what you could do to try to make your plane fly better. So thankfully then, instead of her having to try to figure out what are some of her factors and levels going to be, this website had some really nice suggestions for that. Before we get into this, Easy DOE, as I said, it is that new platform, and you're just going to find it under the DOE menu. Underneath custom and augment design, there's Easy DOE right there. Now, what you'll see with Easy DOE is that idea of going through that workflow is going to be done via tabs. A s soon as you launch Easy DOE, you'll see there's that guided and flexible mode. We're just talking about guided mode here. But the idea is we're going to go through these different steps by clicking on the tabs. There's the Define, and then we're going to go to the Model, Design, et cetera. One way to do that is to click on the tabs, the tabs one at a time. And at the bottom of Easy DOE, you'll also find a set of navigation controls so that will take you forward and backwards between the different tabs. And so the idea of our talk here, we're going to go through these different tabs and both of us are going to give observations. Rory is going to give her thoughts first on those different tabs and then followed by my own. Think of it more as a teacher in my point of view, and Rory's was more as the novice user, trying Easy DOE for the first time. Let's start with the Define tab. On this one, we had to type in the different factors and levels that we were going to do for a paper airplane. None of my factors had numbers, so I chose categorical. The different levels and factors that I had type, and then the levels were dart and lock bottom. And then there was paper, and the levels were regular and construction. In throwing force, the levels for that one was hard and light. And paperclip was paper clip and no paper clip. The response that we had was distance. And for the distance, I wanted to see what could make the paper airplane go the farthest and our JMP goal was maximize. Just to mention. When you go into Easy DOE, you're going to find right now. If you take a look at that screenshot from before. So currently, we have three different types of factors: continuous, discrete, numeric, and categorical factor types. I'll say I think she was able to identify that she needed the categorical for all of those. Of course, now, she did need confirmation. I mean, when she said categorical, she looked back to me to see, "Am I doing that correctly?" S he was able to identify the factor types and actually enter her level names, sorry, her factor names and her levels. Now, I will admit though, she's used to using a touch screen. And so there did come a point instead of her trying to click into the little box for levels or the factor name and to do a double click, there did come a point where I told her you can just use the tab button to make your life easier. A gain, I didn't want to have a whole lot of input. S he picked the four factors and levels and I think I had told her to pick three or four was a good number. But I will say so if you think back to that paper clip. She had mentioned it was the paper clip or no paper clip, that one took a little bit of time. I think maybe if we would have went paper clip yes or no, it might have been easier. But I think for her, it was. Well, how do we actually define a paper clip factor that also has a paper clip level? But I think once she had that idea, okay, it was just kind of a yes or no, it's in there or not, we were able to get through it okay. Moving on from the Define tab. The Model. For the Model tab, we picked the one which we thought would be the best. We decided that Main Effects and two- factor interactions were the ones that we want the number. So we picked the one that had Main Effects and two- factor interactions. The number of runs meant we had to make 16 airplanes which didn't seem too bad. Now, I'll say in hindsight for my own view on the Model tab, I mean this required the most hand-holding of all the tabs. And this was really more because of trying to explain the difference if we just went with main effects versus interactions. But again, so this is a seven and a half- year- old who's never taken any statistics course, has never done any kind of modeling before. If you have somebody who's familiar with the idea of main effects and interactions, then I think that tab wouldn't be nearly as bad. Now, I will say too, though , it was nice. The paper airplane website that we were using, it spelled out the idea of what interactions really mean with those factors. I think so you would see things where it said: while some of the airplanes will work better with a paper clip, or you'll see some airplanes are better when they're throwing hard, while others need that lighter touch. In some sense, that actually gave us a natural point to start talking about interactions. Now, also, say, when I think of from my own perspective, it's also something that we can improve upon. From a US, I want to think about, well, how can we distinguish between these choices? We did have the hints in there as well. But when I think of, if I wasn't there to do the hand-holding, how might she'll come up with that decision? Now when we went to that Model tab, and our next tab that we clicked on was... Design. There's not really a lot I can say about this one, but I thought it was interesting that it was showing what kind of paper airplanes that you were going to make. And then we had the hard work of making planes and flying them. Yeah, I think that probably took the most time, but yeah. I don't have too much to add on this one. I think, though, it was nice to have that design displayed just to give her that sense of, well, what did that really mean when we put in those 16 runs? When we have those factors, what does that mean at the end of the day? I think from this, then she could really get that sense. Okay, then my plane one, it's the lock bottom with regular paper. I'm going to throw it lightly and put a paper clip on it. What was the next tab that we went to? Data Entry. So full of a data entry. We didn't really have to do that much in this one. So we just put the distance of the different kind of airplanes flew. So that was what we measured with our measuring tape of stuff. Is that right? Now, I'll say with this one, if you look, there were these factor plots. This is just for the main effects here, but this actually was telling her what she had said as soon as she was done. She see those factor plots at the bottom, and she said the lock bottom is not the best. I think she might have used something stronger than that, but yeah, so the lock bottom was not good for her. Also, this was a good teaching moment about randomization. So the run order does come out randomized. I did have to warn her when we had all the paper airplanes with us outside, I said, well, you don't want to throw all the dart type first, followed by the long bottom. I said, because you might get better as you were throwing or it might start to storm, it might get windier. Now, I say some of these results. There were times where the hard- throwing force needed a bit of practice. I'll admit there were a few there that probably were based on more than one, because if you had a crash, almost immediate crash landings. But I'll say it did seem straightforward for her to be able to enter the data right from there. I mean, I didn't have to say anything; she knew. So when you come in to hear it, the response column just had missing values. And so she had that intuition, well, this is where I need to go to put in the data. I also mentioned here, you'll see this export data and load response. That export data is if you actually just want to create a data table with all your stuff. S ometimes that will be useful if you want to go through what you're typically thinking of with your JMP workflow. If you just want to JMP data table, that's what that export data button is going to do. Likewise, load response, if you've actually just recorded your responses in a different data table, you can do that. Now that we had our data, what did we have to do? -What was our next step? -Analyze. For the analyze, I already knew that the dotted ones weren't the most important ones. I figured out I just needed to click them to get rid of them. I found out dart type was one of the most important ones. Now, I'll say here. The analyze I had thought was going to be the hardest one to explain. Now, I'll say it was surprisingly easy and effective. It was almost she had clicked on the tab and just started doing her own thing, and I didn't really need to say an awful lot. Now, you'll notice here, so when you come into that analyze tab, when Rory saw it, it was the full model, but she saw a lot of the terms that were not significant. There were a lot that were dashed and close to that zero. And so what she did was just remove those ones one at a time. And then I also see there is a best model button that's based on some type of a forward selection. Now, I'll say the best model may actually have more terms, but at the end, you can see those extra terms really weren't even significant. S ometimes I couldn't even argue with the results. Perhaps, one might even argue that the model that Rory picked was better because it was simpler and there wasn't a huge difference between the two. Now, one of the nice things with this Easy DOE platform, this analyzed for the guided users, there's this idea of adding and removing terms easily. And so to add and remove these terms, all that you do is you go to those confidence intervals in there and a click will either add it or remove it depending on if it's currently in the model. I highly recommend trying that out when you get your hands on 17. Perhaps, there was something to be said. The best model, when we'll see in a minute with the profiler, it makes it a little bit more interesting, of course, if you have some additional terms in there. Rory's model, with only the three terms, the profiler was less interesting to try to explain things. But again, for being a first time DOE, a model like that didn't disappoint me at all. Now that we had a model, what do we actually do with that? -So what was our next step? -Predict. The Predict tab, this tab tells you what the best paper airplane was. If you click on the levels that you think are the worst, it shows what might happen with that airplane. Likewise with that Model tab, if you have a user who's used the profiler before in something like Fit model, of course, that'll be a lot easier. But I'll also say the profiler is intuitive in general. I think it was easy for her to pick up on once she started playing around with that profiler. And then just a little bit of an idea, a discussion as to what that actually meant. But I think the profiler, that's the nice thing with the Easy DOE, the profiler is already intuitive. It was also a good teaching moment when it came to interactions. If you go back, if you look at that model, we actually had an interaction between the type and the type of paper. Paper and type had an interaction and so then we could talk about well, what happens when we change type, then look at what happens with that paper? Again, this is where those extra model terms. In Rory's model, that hard and light had a zero effect. I was saying it didn't really matter what you did for hard or light. The best model, it had a small effect, so you could say, well, it's not going to make a big difference. But for some of these, it was saying that it was. But again, I think the Predict tab, she seemed to do well in that. And the optimized did need some explanation, I think just because that's a new word in her vocabulary. But I think that she had that sense as to, okay, these were the settings that were going to be the best. I think that was what we had to talk to you about today. I did just have a few final questions for you, if that's okay. What was your favorite part of the experiment? Flying the airplane. What was your least favorite part? Getting hard flying the airplane. The air was about 100 degrees Fahrenheit when we were flying the airplanes, but we didn't have a lot of choice. We had a lot of storms coming up to it, didn't we? If you were to tell somebody, what was the most important factor? What's the most important thing if you wanted to make the paper airplane? Maybe play type, like, with a dart. Okay. Was there anything else that was important? Nothing you can really think of. Okay. I think we found the construction paper a little bit, and I think it did actually say the no paper clip. I think you were saying that might be because of the weight. Bul I'll say yeah. If we tried it again, I think you had the paper clip at the back part of the wing. And so we thought about maybe at the nose if we were to try that again. But let's say, what do you want to do for your next experiment? Should we do another one? What would you want to do? Statapult. A statapult? Yeah, that does look fun. Ane now, I didn't ask you to answer this in any way, did I? But was Easy DOE easy to use? Yes. It was. Okay. I'm glad to hear. I think that's everything that we have left today. So thank you for your time. And please post any questions in the community forums below if you'd like to ask either of us anything. Thank you.
Labels
(13)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Interactive HTML Improvements in JMP 17 (2022-US-30MP-1162)
Monday, September 12, 2022
Interactive HTML was introduced in JMP 11, and each succeeding version adds support for more interactive features and improves support for JMP Live. When used alone, Interactive HTML enables users to share JMP reports. JMP Live supports that sharing with collaboration, organization, security, automation, and significantly more interactivity. This paper explains the new features in JMP 17 Interactive HTML, both when it's used alone and as part of JMP Live. The journal shown in the presentation contains Example buttons that link to Interactive HTML files hosted on an internal JMP Live server. I've attached a journal formatted similarly to the one shown in the video, but rather than launch files from a JMP Live server, the Example buttons produce JMP reports that can be exported to Interactive HTML or published to JMP Live or JMP Public (when it is upgraded to JMP 17) from JMP 17 to see the improvements in Interactive HTML yourself. Hi, welcome to this talk. Interactive HTML is a format for sharing JMP reports and dashboards on the web with some of JMP's signature interactivity. It's also the format used in JMP Live and JMP Public. My name is John Powell, and I'm the Software D evelopment M anager for the team that puts this feature together for JMP. With every release of JMP, we improve and add more of this capability. In this talk, I'm going to show you what's new and improved in I nteractive HTML for JMP 17. Now, I'm not going to cover everything that we've done, just the highlights. I've organized this into three categories: new functionality, improved functionality, and improved appearance. What you're going to see are examples of this on JMP Live. Let's get started. I'm going to move this out of the way. Sorry about that. Move this out of the way, because I'm going to be bringing up the browser. Starting with new functionality. We've got packed bars, which is a feature that's been in JMP for a couple of releases now. I've got an example. I just need to bring my browser window over here. Here's my pack bars example in JMP Live. As you can see, it looks like it would in JMP. It's got labels, and it's got the bars that basically are the most important ones in blue, and all the lesser important ones, or smaller data, are in gray. It looks like it would in JMP. It supports a little bit of interactivity, basically these tool tips that you'll see for each bar, and even for the gray items as well. Then you also have some interactivity, which is selections, and it works with the local data filter. Right now, I'm sorting... Or I'm looking at the commodity of corn. If I want to look at soybean, I can click on that, and the graph updates . That's it for pack bars. Parallel plots [support in Interactive HTML] is brand new in JMP 17. Here's an example using the Iris data set. With a parallel plot, you can drop in different types of variables. In this case, we're looking at the dimensions of the petals and the sepal length and width. Basically, they're all continuous data, and continuous data is drawn as curves. We've made this available in JMP Live now and in Interactive HTML. Basically, it supports selection and tool tips. There's a tool tip for one of the lines, and this is continuous. Then, one thing about this particular example is that it has a common axis. All the variables are around the same range. They share an axis. The next example is using categorical variables. Here's the Titanic passenger database, and I've got passenger class, sex, and survived. When you use categorical variables, they display very differently. You see these curved sections, and they're also selectable. You'll notice a little interactivity with the legend. I'll talk about that a little bit more later. Now, we also have tool tips in this one. Passenger class, sex, male. So on this side, you would get sex is female, and they survived, yes. The next example. This one is actually mixed with categorical and continuous, the same data set, Titanic passengers, but I've used some continuous variables on the left here, and some categorical variables on the right. They also support selection and tool tips on both sides. Next, a new functionality is categorical response profilers. The difference between a categorical response profiler and a continuous report p rofiler is basically, you'll see multiple curves. Here's an example. That's interactive n ow. This would have been static in J MP 16. You see these curves are interacting, and there are tool tips as well. These appear in many different platforms. That was th e... Sorry, this is, yes, the ordinal logistic. The next one is general regression. Each platform needed some work from my team in order to get them interactive. This one shows down at the bottom, and it's the same thing. It has categorical responses, and it shows the interactivity and tool tips as well. I'm not going to do all of them. Here's one more, P artition. [ The profiler in ] Partition also is down below. Here's a categorical profiler. You'll see it responds as you move one of the variables. All the other variables respond too. Now, the next one actually isn't a categorical response profiler, but I threw it in here because it's a new platform in JMP 17, Naive Bayes, and it has a profiler in it. It has also been made interactive in Interactive HTML and on JMP L ive. The other two that we support are B ootstrap F orest and Boosted T rees. I'll leave that up to you to try when you get JMP 17. This, we hope, is going to be an interesting feature for people, that we can resize graphs now in JMP Live and JMP 17. One of the first things you might do in JMP, if somebody sends you a report, and it's not big enough, is you go to the [ bottom right ] corner and drag it to resize it. We now support that on the web as well. That's a distribution example. Some examples are a little tricky, like this one here. I'm bringing up a profiler. What makes that tricky is that there are multiple graphs that all resize together. Of course, they stay interactive when they're larger as well. Let's see, the scatter plot matrix is similar too, that there are multiple. I don't know if I mentioned, but you can also drag on the side to make them bigger. There you have it. Of course, these are still interactive with tool tips and so on. Now, since we did that, we felt it was necessary to also make it possible to resize panels and dashboards. Imagine if I took this graph here and I expanded it wider. Now, it's getting pretty close to the edge. If I went and did this one, now we're actually cutting off a little bit of the graph. We can actually resize the panel to give more room to that one. We can even shrink these if we didn't want to show them at the moment, or to just balance between the two panels. That's it for new functionality. I'm going to move on to improved functionality. I'll just take a drink here. All right. Now, the interactive legend, I did show it before, but I wanted to show that it's happening in more than just one graph. It's actually all graphs. I have an example here. This is back with the Titanic passengers database again. To show you the interactivity, basically, if you click on the legend in JMP 17, you will see the selection in the graph as well. That's one part of this interactivity. The other part is that if you click in the graph to make selections, you'll see that the legend highlights as well. That's behavior that you would see in JMP. We tried to get that available on the web as well. Local data filter has been in Interactive HTML for a while, but there are a few additions to it, modifications and updates. Here's an example with the diamonds data set. As you can see, it's got a pretty busy- looking local data filter here. You can do lots with it. I'll stick with this example for a while. Imagine if you wanted to just limit what you see in terms of the price. I'm looking at the $4,900- $10,000 diamonds. Then you go, "Well, maybe I don't really want to do that. I want to do the inverse." Now, we have this inverse, that wasn't there in JMP 16, that when you click on it, the graph will update. Nothing updates in the local data filter itself. We used to have a feature in this menu up here that would do that. It would invert all the settings. But JMP didn't have that. We prefer to stick to a model that's closer to what JMP has and behaves the way JMP does as well. Another thing that you might see with a big local data filter like this with lots of options, every time I click on the setting here, it updates, right? But what if you want to make a lot of changes? You probably don't want it updating every single change that you make in the local data filter. Now, we've added this auto- update feature. Now, if we disable it, as you add more settings, nothing changes. That gives you a chance to make lots of changes. Maybe I'll even change something down here as well. Let's just choose Excellent and these V settings, and I'll leave these sliders where they were. Now, I'm ready to update. I hit the update button, and now we've got an updated graph. You may have noticed, or maybe you didn't, that there's a bunch of information being added to this URL every time I change the setting or I updated. Now, the purpose of that is so that people can share these settings. If you actually selected all of this, and there's multiple ways to select. You might try to double click. Anyway, once they're all selected, then you can copy it or use Control C. Then you can put that in an email and send it to somebody. When they see this graph on JMP Live, they're going to get the settings that you had, not the original settings that you published it with. That's really going to be useful. Another thing about that is that the saving of settings is also done for column switchers now too. This example has a column switcher and a local data filter. Of course, if I change to lead, basically, you would see those settings get updated on the top as well. What's interesting about that is you can grab this URL, and I like to store it in the comments, and I've annotated what I stored here saying what the settings are going to do. This top link is with the column switcher set to the pollutant equal carbon monoxide, and the local data filter regions set to California. I believe west is what the W stands for. When I click on that, or if I send that to somebody and they click on it, it would load the post with this pollution map, and then it would use those settings and update it right away. Another thing you could do with that URL is you can embed it in a journal, like I've done here. If you look at this side here, I say I have a link with stored settings with the lead for the column switcher. The regions are South Texas, which is different than what you see right now. When I click on that URL or that link in my journal, it will do the same. It loads up this posting, and then it applies the settings. Isn't that convenient? All right. Here's a feature that we added, at publish time, that you could actually choose whether you want to have interactivity or performance. The reason for that is that it takes a lot of data to store in the file to be able to provide the interactivity. Sometimes that makes loading slower. Or if you have a really big data set and you don't really want to load all the data, but you want to have it interactive, well, now you can do that with the performance mode. All the examples I showed so far had interactivity mode on so that I can show all the interactivity. This example I published, it's the same one that you just saw, but the graph is not interactive. It's a static image of the graph. But you also get to interact with that graph by using the column switcher and local data filter. This is because JMP L ive rebuilds the graph. That's it for that section. I hope you're going to like those improved functionality features. The final section is things that affect the appearance. We hadn't really paid attention to your font size settings in JMP when you exported in the past because we really wanted to have unique font sizes across our web offerings. But now, we wanted to support this. Basically, this is everywhere in JMP that you can change font sizes. It was big effort, basically, to go through, find them all, and then make sure that they came out as you set them before you published. I've got a few examples, but definitely not all the places where you can set fonts. Here's an example. You may not want to do this, but I did it in a way that you could see what the font differences are. In this case, I increased the carat weight title, but I did not increase the labels for the carat weight. Down on the X-axis, I increased the axis label's font size and kept the title at the original size. Normally, you'd probably do a little bit more, but this is just for demonstration, so you could see the difference between the regular size and the increased size. Here's another example where font size shows up. Labels and graphs, like in heat maps. In heat maps, there's a label that you can apply, and this one has been increased as well. You can tell it's much larger than the other fonts in this graph. I've also increased the size of the title and labels in the legend in this example. Next up, I've got a tree map that also has labels. This is back with the diamonds data set again. This tree map, we did a little bit more than just font size. You can see that the labels are made larger, not in the legend this time. But the group labels, in JMP 17, they've added the ability to set background color and the color of the text as well. We felt that we would want to support that as well to stick with the customized ability. Of course, there's many other places where fonts can be customized in JMP. You could discover that as you get JMP 17, and start customizing font sizes, and then see them respond in your published files or shared with exporting to Interactive HTML. Here's a tabulate example. There's a new feature in JMP that allows you to combine columns like cut and clarity. This is also the diamonds data set. When you do and use the stack group columns feature, the items in the cells are basically indented for the secondary variable. In this example, I've also increased the titles of these columns in tabulate, increased the font size of that, so easier to see and show that yet another place where this can be customized. We did some other work customizing tables, supporting the customization of tables, and this shows up in many different ways. The first one that I'm going to show you is color- coded cells. In JMP 16,even if color coding is used, we didn't carry that through to the web. You would have lost the different colors here in the column that have meaning in this process screening example. Another place where color coding shows up in cell background colors is when you do text explorer. As you see in this graph down here, there's purple and orange being used for positive and negative sentiment. They are now supported in the table as well, not just in the graph. Of course, font size can also be updated. I mentioned that we supported customizing font size. If that's been done in a table like in this one, where the cells have been increased in size, that is now respected. We used to emphasize small p values when it's small enough to indicate it with just bold font. But what JMP did was use this color, and numbers support that color, so we're looking more like JMP. This final table example will show not cell -color coding, but the actual text in the cells is color coded for the correlation. The high correlation, or positive one correlation here happens to be blue, and here's negative correlation that's red. I'm going to use this example for the next feature, which is, we had themes before, but we've added the dark theme. The dark theme is good to show with this example, and I'll show you how you do that in JMP Live. Let me just switch to dark mode this way. Now, you can see those colors pop a little bit more with a black background. I think people will like this, maybe all the time. Actually, if you like it all the time, you can set that as permanent preference in JMP Live, or set it as a preference when you export. Here, I'm going to open up just an image of a JMP dialogue, which is the preferences dialogue. If you look on the Styles page of the preferences dialogue, down at the bottom, there's Interactive HTML Themes. It used to have light and gray. Now, we have this dark theme as well. Then, when you publish or export to Interactive HTML, that theme will be used. Last but not least, we think this will be really important. If you zoomed in on graphs [ exported to Interactive HTML ] from JMP 16, they would start to get blurry. I want to talk about two different types of zooming in here. In 16, we did have this magnifier. Let's say I'm zooming in this way. That hasn't changed. But what's changed is that this graph here is at a higher resolution than it used to be. If you use your browser zoom, which is usually C ontrol Plus, and zoom in this way, this is when things would start to get blurry. But now that we've used double resolution, you can go pretty high without seeing any blurriness. We think that will be a good feature. You don't have to turn it on. It's just the default that it's going to be a higher resolution. Make all your graphs look a little bit better. That's it for the things I have to show. Like I said, these are just the highlights. There's so many other things that my team did, and I'm really proud of the work that they did. We had a lot of contributors, some of them part time, some of them full time. Here's the names of the contributors. We had Josh Mark wardt, Bryan Fricke, Mayowa Sogbein, Praveena Panineerkandy, Paul Spychala, and Tommy Odom. Of course, we'd like to thank all the JMP developers themselves that do the desktop product to help us figure out how to share these creations on the web. We get a lot of help from them. Of course, we also get a lot of help from our quality assurance team. I'd like to thank them as well, and anybody else in the JMP team that helps us make our contribution look good, Also, the JMP Live team who hosts our content when we publish to JMP Live. That's it for my presentation. I thank you for your interest in this topic, and I'll see you at the conference.
Labels
(7)
Labels:
Labels:
Automation and Scripting
Basic Data Analysis and Modeling
Content Organization
Data Exploration and Visualization
Mass Customization
Quality and Process Engineering
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Query Builder (2022-US-EPO-1160)
Monday, September 12, 2022
Extracting and combining data from multiple, disconnected databases was one of the biggest challenges encountered at Samson Rope Technologies before our investment in JMP Statistical Software. With the incredibly useful JMP Query Builder, this challenge has been economically addressed. This presentation describes the way Samson uses the JMP Query Builder to easily extract desired records and fields from a single database (leaving behind the fields that are not of interest) and, more importantly, extract desired records and fields from multiple databases and combine them into a single “master table” file with all required data in one place, ready for rigorous statistical analysis, typically by an R&D Engineer, Quality Engineer or Manufacturing Engineer working on a continuous improvement project. One of the best JMP features is that JMP creates scripts behind the scenes and if the scripts are properly managed, there is no need to spend time repeating the previous steps to create or update the “master table”. And, the best part of the JMP Query Builder is that we can create complex JMP scripts just by simple “copy and paste” and combination of the scripts JMP creates for each previous action. No background in scripting or coding is needed to create a JMP script! This poster will be highly graphical in nature, drawing the attention of the viewer to the practical, useful benefits of the JMP Query Builder. Any organization with multiple databases containing important information should benefit from this poster. Hello. My name is Canh Khong. I work at Samson Rope Technology in Ferndale, Washington State. Today, I am talking about how to build a Query Builder using JMP software. Let's start with a working table that has a few columns such as date, operator i nitial, job number, actual measurements. The working table is just for manual data collection. To analyze, we may need to add more information from other tables. How can I do that? The answer is using a Query Builder. Query Builder can extract desired columns from other tables and join these columns to the working table that I call the Master Table. The Master Table has all of the information needed for analysis. Here are a few steps to build a Query Builder to join the working table to another table. These steps are as simple as point, click, drag and drop. Now, the Master Table is ready for data analysis. However, if needed, a new column with a formula can be manually created adding to the Master Table, and JMP will automatically create the corresponding script of a new column. Copy and paste the script to the Post Query Script. If this is done, there is no need to spend time to recreate the new column. The new column will be updated and shown every time the Query Builder runs. If desired, the report can be manually created. And JMP will create the corresponding script. Copy and paste the script to the Post Query Script. If this is done, there's no need to spend time to recreate the report. The report will be automatically updated and shown every time the Query Builder runs. If desired, a chart can be manually created and JMP will create the corresponding script. Copy and paste the script to the Post Query Script. If this is done, there is no need to spend time to recreate the chart. The chart will be automatically updated and shown every time the Query Builder runs. One click to run the Query Builder. This will produce the updated Master Table, the report, and the chart as desired. Query Builder Summary. Combine tables. Select only desired columns from tables. Create columns with formulas and create reports. Copy and paste the scripts of columns and reports into Post Query Script. Scripting experience is not needed to build a Query Builder. Creating a Query Builder is as simple as " point, click, drag and drop". With this summary, that concludes my presentation. Thank you.
Labels
(13)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Access
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Different goals, different models: How to use models to sharpen up your questions (2022-US-45MP-1159)
Monday, September 12, 2022
The famous Stanford mathematician, Sam Karlin, is quoted as stating that “The purpose of models is not to fit the data but to sharpen the question” (Kenett and Feldman, 2022). A related manifesto on the general role of models was published in Saltelli et al (2020). In this talk, we explore how different models are used to meet different goals. We consider several options available on the JMP Generalized Regression platform including ridge regression, lasso, and elastic nets. To make our point, we use two examples. A first example consists of data from 63 sensors collected in the testing of an industrial system (From Chapter 8 in Kenett and Zacks, 2021). The second example is from Amazon reviews of crocs sandals where text analytics is used to model review ratings (Amazon, 2022). References Amazon, 2022, https://www.amazon.com/s?k=Crocs&crid=2YYP09W4Z3EQ3&sprefix=crocs%2Caps%2C247&ref=nb_sb_noss_1 Feldman, M. and Kenett, R.S. (2022), Samuel Karlin, Wiley StatsRef, https://onlinelibrary.wiley.com/doi/10.1002/9781118445112.stat08377 Kenett, R.S. and Zacks, S. (2021) Modern Industrial Statistics: With Applications in R, MINITAB, and JMP, 3rd Edition, Wiley, https://www.wiley.com/en-gb/Modern+Industrial+Statistics%3A+With+Applications+in+R%2C+MINITAB%2C+and+JMP%2C+3rd+Edition-p-9781119714903 Saltelli, A. et al (2020) Five ways to ensure that models serve society: a manifesto, Nature, 582, 482-484, https://www.nature.com/articles/d41586-020-01812-9 Hello, I'm Ron Kennett. This is a joint talk with Chris Gotwalt. We're going to talk to you about models. Models are used extensively. We hope to bring some additional perspective on how to use models in general. We call the talk "Different goals, different models: How to use models to sharpen your questions." My part will be an intro, and I'll give an example. You'll have access to the data I'm using. Then Chris will continue with a more complex example, introduce the SHAP values available in JMP 17, and provide some conclusions. We all know that all models are wrong, but some are useful. Sam Karlin said something different. He said that the purpose of models is not to fit the data, but to sharpen the question. Then this guy, Pablo Picasso, he said something in... He died in 1973, so you can imagine when this was said. I think in the early '70 s. "Computers are useless. They can only give you answers." He is more in line with Karlin. My take on this is that this presents the key difference between a model and a computer program. I'm looking at the model from a statistician's perspective. Dealing with Box's famous statement. "Yes, some are useful. Which ones?" Please help us. What do you mean by some are useful? It's not very useful to say that . Going to Karlin, "Sharpen the question." Okay, that's a good idea. How do we do that? The point is that Box seems focused on the data analysis phase in the life cycle view of statistics, which starts with problem elicitation, moves to goal formulation, data collection, data analysis, findings, operationalization of finding, communicational findings, and impact assessment. Karlin is more focused on the problem elicitation phase. These two quotes of Box and Karlin refer to different stages in this life cycle. The data I'm going to use is an industrial data set, 174 observations. We have sensor data. We have 63 sensors. They are labeled 1, 2, 3, 4 to 63. We have two response variables. These are coming from testing some systems. The status report is fail/pass. 52.8% of the systems that were tested failed. We have another report, which is a more detailed report on test results. When the systems fail, we have some classification of the failure. Test result is more detailed. Status is go/no go. The goal is to determine the system status from the sensor data so that we can maybe avoid the costs and delays of testing, and we can have some early predictions on the faith of the system. One approach we can take is to use a boosted tree. We put the status as the response, the 63 sensors X, factors. The boosted tree is trained sequentially, one tree at a time. T he other model we're going to use is random forest, and that's done with independent trees. There is a sequential aspect in boosted trees that is different from random forests. The setup of boosted trees involves three parameters: the number of trees, depth of trees, and the learning rate. This is what JMP gives as a default. Boosted trees can be used to solve most objective functions. We could use it for poisson regression, which is dealing with counts that is a bit harder to achieve with random forests. We're going to focus on these two types of models. When we apply the boosted tree, and we have a validation set up with 43 systems drawn randomly as the validation set. A hundred and thirty-one systems is used for the training set. We are getting a 9.3% misclassification rate. Three failed systems. We know that they failed because we have it in the data, were actually classified as pass. The 20 that passed, 19 were classified as pass. The false predicted pass is 13 %. We can look at the variable column contributions. We see that Sensor 56, 18, 11, and 61 are the top four in terms of contributing to this classification. We see that in the training set, we had zero misclassification. We might have some over fitting i n this BT application. If we look at the lift curve, 40 % of the systems, we can get over two lift which is the performance that this classifier gives us. If we try the boos trap forest, another option, again, we do the same thing. We use the same validation set. The defaults of JMP are giving you some parameters for the number of trees and the number of features to be selected at each mode. This is how the random forest works. You should be aware that this is not very good if you have categorical variables and missing data, which is not our case here. Now, the misclassification rate is 6.9, lower than before. On the training set, we had some misclassification. The random forest applied to the test status, which means when we have the details on the failures is 23.4, so bad performance. Also, on the training set, we have 5% misclassification. But we have now a wider range of options and that is explaining some of the lower performance. In the lift curve on the test results, we actually, with quite good performance, can pick up the top 10 % of the systems with a leverage of above 10. So we have over a ten fold increase for 10 % of the systems relative to the grand average. Now this is posing a question— remember the topic of the talk— what are we looking at? Do we want to identify top score good systems? The random forest would do that with the test result. Or do we want to predict a high proportion of pass? The bootstrap tree would offer that. A secondary question is to look at what is affecting this classification. We can look at the column contributions on the boosted tree. Three of the four top variables show up also on the random forest. If we use the status pass/fail, or the detailed results, there is a lot of similarity on the importance of the sensors. This is just giving you some background. Chris is going to follow up with an evaluation of the sensitivity of this variable importance, the use of SHAP v alues and more interesting stuff. This goes back to questioning what is your goal, and how is the model helping you figure out the goal and maybe sharpening the question that comes from the statement of the goal. Chris, it's all yours. Thanks, Ron. I'm going to pick up from where Ron left off and seek a model that will predict whether or not a unit is good or not, and if it isn't, what's the likely failure mode that has resulted? This would be useful in that if a model is good at predicting good units, we may not have to subject them to much further testing. If the model gives a predicted failure mode, we're able to get a head start on diagnosing and fixing the problem, and possibly, we may be able to get some hints on how to improve the production process in the future. I'm going to go through the sequence of how I approached answering this question from the data. I want to say at the outset that this is simply the path that I took as I asked questions of the data and acted on various patterns that I saw. There are literally many other ways that one could proceed with this. There's often not really a truly correct answer, just a criterion for whether or not the model is good enough, and the amount that you're able to get done in the time that you have to get a result back. I have no doubt that there are better models out there than what I came up with here. Our goal is to show an actual process of tackling a prediction problem, illustrating how one can move forward by iterating through cycles of modeling and visualization, followed by observing the results and using them to ask another question until we find something of an answer. I will be using JMP as a statistical Swiss army knife, using many tools in JMP and following the intuition I have about modeling data that has built up over many years. First, let's just take a look at the frequencies of the various test result categories. We see that the largest and most frequent category is Good. We'll probably have the most luck being able to predict that category. On the other hand, the SOS category has only two events so it's going to be very difficult for us to be able to do much with that category. We may have to set that one aside. We'll see about that. Velocity II, IMP, and Brake are all fairly rare with five or six events each. There may be some limitations in what we're able to do with them as well. I say this because we have 174 observations and we have 63 predictors. So we have a lot of predictors for a very small number of observations, which is actually even smaller when you consider the frequencies of some of the categories that we're trying to predict. We're going to have to work iteratively by doing visualizations in modeling, recognizing patterns, asking questions, and then acting on those with another modeling step iteratively in order to find a model that's going to do a good job of predicting these response categories. I have the data sorted by test results, so that the good results are at the beginning, followed by each of the different failure modes d ata a fter that. I went ahead and colored each of the rows by test results so that we can see which observation belongs to a particular response category. So then I went into the model- driven multivariate control chart and I brought in all of the sensors as process variables. Since I had the good test results at the beginning, I labeled those as historical observations. This gives us a T² chart. It's chosen 13 principal components as its basis. What we see here is that the chart is dominated by these two points right here and all of the other points are very small in value relative to those two. Those two points happen to be the SOS points. They are very serious outliers in the sensor readings. Since we also only have two observations of those, I'm going to go ahead and take those out of the data set and say, well, SOS is obviously so bad that the sensors should be just flying off the charts. If we encounter it, we're just going to go ahead and try to concern ourselves with the other values that don't have this off- the- charts behavior. Switching to a log scale, we see that the good test results are fairly well -behaved. Then there's definite signals in the data for the different failure modes. Now we can drill down a little bit deeper, taking a look at the contribution plots for the historical data, the good test result data, and the failure modes to see if any patterns emerge in the sensors that we can act upon. I'm going to remove those two SOS observations and select the good units. If I right-click in the plot, I can bring up a contribution plot for the good units, and then I can go over to the units where there was a failure, and I can do the same thing, and we'll be able to compare the contribution plots side by side. So what we see here are the contribution plots for the pass units and the fail units. The contribution plots are the amount that each column is contributing to the T ² for a particular row. Each of the bars there correspond to an individual sensor for that row. Contribution plots are colored green when that column is within three sigma, using an individuals and moving range chart, and it's red if it's out of control. Here we see most of the sensors are in control for the good units, and most of the sensors are out of control for the failed units. What I was hoping for here would have been if there was only a subset of the columns or sensors that were out of control over on the failed units. Or if I was able to see patterns that changed across the different failure modes, which would help me isolate what variables are important for predicting the test result outcome. Unfortunately, pretty much all of the sensors are in control when things are good, and most of them are out of control when things are bad. So we're going to have to use some more sophisticated modeling to be able to tackle this prediction problem. Having not found anything in the column contributions plots, I'm going to back up and return to the two models that Ron found. Here are the column contributions for those two models, and we see that there's some agreement in terms of what are the most important sensors. But boosted tree found a somewhat larger set of sensors as being important over the bootstrap forest. Which of these two models should we trust more? If we look at the overall model fit report, we see that the boosted tree model has a very high training RS quare of 0.998 and a somewhat smaller v alidation RS quare of 0.58. This looks like an overfit situation. When we look at the random forest, it has a somewhat smaller training RS quare, perhaps a more realistic one, than the bootstrap forest, and it has a somewhat larger validation RS quare. The generalization performance of the random forest is hopefully a little bit better. I'm inclined to trust the random forest model a little bit more. Part of this is going to be based upon just the folklore of these two models. Boosted trees are renowned for being fast, highly accurate models that work well on very large datasets. Whereas the hearsay is that random forests are more accurate on smaller datasets. They are fairly robust, messy, and noisy data. There's a long history of using these kinds of models for variable selection that goes back to a paper in 2010 that has been cited almost 1200 times. So this is a popular approach for variable selection. I did a similar search for boosting, and I didn't quite see as much history around variable selection for boosted trees as I did for random forests. For this given data set r ight here, we can do a sensitivity analysis to see how reliable the column contributions are for these two different approaches, using the simulation capabilities in JMP Pro. What we can do is create a random validation column that is a formula column that you can reinitialize and will partition the data into random training and holdout sets of the same portions as the original validation column. We can have it rerun these two analyses and keep track of the column contribution portions for each of these repartitionings. We can see how consistent the story is between the boosted tree models and the random forests. This is pretty easy to do. We just go to the Make Validation Column utility and when we make a new column, we ask it to make a formula column so that it could be reinitialized. Then we can return to the bootstrap forest platform, right- click on the column contribution portion, select Simulate. It'll bring up a dialog asking us which of the input columns we want to switch out. I'm going to choose the validation column, and I'm going to switch in in replacement of it, this random validation formula column. We're going to do this a hundred times. Bootstrap forest is going to be rerun using new random partitions of the data into training and validation. We're going to look at the distribution of the portions across all the simulation runs. This will generate a dataset of column contribution portions for each sensor. We can take this and go into the graph builder and take a look and see how consistent those column contributions are across all these random partitions of the data. Here is a plot of the column contribution portions from each of the 100 random reshufflings of the validation column. Those points we see in gray here, Sensor 18 seems to be consistently a big contributor, as does Sensor 61. We also see with these red crosses, those are the column contributions from the original analysis that Ron did. The overall story that this tells is that the tendency i s that whenever the original column contribution was small, those re simulated column contributions also tended to be small. When the column contributions were large in the original analysis, they tended to stay large. We're getting a relatively consistent story from the bootstrap forest in terms of what columns are important. Now we can do the same thing with the boosted tree, and the results aren't quite as consistent as they were with the bootstrap forest. So here is a bunch of columns where the initial column contributions came out very small but they had a more substantial contribution in some of the random reshuffles of the validation column. That also happened quite a bit over with these Columns 52 through 55 over here. Then there were also some situations where the original column contribution was quite large, and most, if not all, of the other column contributions found in the simulations were smaller. That happens here with Column 48, and to some extent also with Column 11 over here. The overall conclusion being that I think this validation shuffling is indicating that we can trust the column contributions from the bootstrap forest to be more stable than those of the boosted tree. Based on this comparison, I think I trust the column contributions from the bootstrap forest more, and I'm going to use the columns that it recommended as the basis for some other models. What I'd like to do is find a model that is both simpler than the bootstrap forest model and performs better in terms of validation set performance for predicting pass or fail. Before proceeding with the next modeling stuff, I'm going to do something that I should have probably done at the very beginning, which is to take a look at the sensors in a scatterplot matrix to see how correlated the sensor readings are, and also look at histograms of them as well to see if they're outlier- prone or heavily skewed or otherwise highly non- gaussian. What we see here is there is pretty strong multicollinearity amongst the input variables generally. We're only looking at a subset of them here, but this high multicollinearity persists across all of the sensor readings. This suggests that for our model, we should try things like the logistic lasso, the logistic elastic net, or a logistic ridge regression as candidates for our model to predict pass or fail. Before we do that, we should go ahead and try to transform our sensor readings here so that they're a little bit better- behaved and more gaussian- looking. This is actually really easy in JMP if you have all of the columns up in the distribution platform, because all you have to do is hold down Alt , choose Fit Johnson, and this will fit Johnson distributions to all the input variables. This is a family of distributions that is based on a four parameter transformation to normality. As a result, we have a nice feature in there that we can also broadcast using Alt Click, where we can save a transformation from the original scale to a scale that makes the columns more normally distributed. If we go back to the data table, we'll see that for each sensor column, a transform column has been added. If we bring these transformed columns up with a scatterplot matrix and some histograms, we clearly see that the data are less skewed and more normally distributed than the original sensor columns were. Now the bootstrap forest model that Ron found only really recommended a small number of columns for use in the model. Because of the high collinearity amongst the columns, the subset that we got could easily be part of a larger group of columns that are correlated with one another. It could be beneficial to find that larger group of columns and work with that at the next modeling stage. An exploratory way to do this is to go through the cluster variables platform in JMP . We're going to work with the normalized version of the sensors because this platform is PCA and factor analysis based, and will provide more reliable results if we're working with data that are approximately normally distributed. Once we're in the variable clustering platform, we see that there is very clear, strong associations amongst the input columns. It has identified that there are seven clusters, and the largest cluster, the one that explains the most variation, has 25 members. The set of cluster members is listed here on the right. Let's compare this with the bootstrap forest. Here on the left, we have the column contributions from the bootstrap forest model that you should be familiar with by now. On the right, we have the list of members of that largest cluster of variables. If we look closely, we'll see that the top seven contributing terms all happen to belong to this cluster. I'm going to hypothesize that this set of 25 columns are all related to some underlying mechanism that causes the units to pass or fail. What I want to do next is I want to fit models using the generalized regression platform with the variables in Cluster 1 here. It would be tedious if I had to go through and individually pick these columns out and put them into the launch dialog. Fortunately, there's a much easier way where you can just select the rows in that table and the columns will be selected in the original data table so that when we go into the fit model launch dialog, we can just click Add and those columns will be automatically added for us as model effects. Once I got into the Generalized Regression platform, I went ahead and fit a lasso model and elastic net model and a ridge model to have them compared here to each other, and also to the logistic regression model that came up by default. We're seeing that the lasso model is doing a little bit better than the rest in terms of its validation generalized RS quare. The difference between these methods is that there's different amounts of variable selection and multicollinearity handling in each of them. Logistic regression has no multicollinearity handling and no variable selection. The lasso is more of a variable selection algorithm, although it has a little bit of multicollinearity handling in it because it's a penalized method. Ridge regression has no variable selection and is heavily oriented around multicollinearity handling. The elastic net is a hybrid between the lasso and ridge regression. In this case, what we really care about is just the model that's going to perform the best. We allow the validation to guide us. We're going to be working with the lasso model from here on. Here's the prediction profiler for the lasso model that was selected. We see that the lasso algorithm has selected eight sensors as being predictive of pass or fail. It has some great built-in tools for understanding what the important variables are, both in the model overall and, new to JMP Pro 17, we have the ability to understand what variables are most important for an individual prediction. We can use the variable importance tools to answer the question, "What are the important variables in the model?" We have a variety of different options for this. We have a variety of different options for how this could be done. But because of the multicollinearity and because this is not a very large model, I'm going to go ahead and use the dependent resampled inputs technique, since we have multicollinearity in the data, and this has given us a ranking of the most important terms. We see that Column 18 is the most important, followed by Column 27 and then 52, all the way down. We can compare this to the bootstrap forest model, and we see that there's agreement that Variable 18 is important, along with 52, 61, and 53. But one of the terms that we have pulled in because of the variable clustering step that we had done, Sensor 27 turns out to be the second most important predictive in this lasso model. We've hopefully gained something by casting a wider net through that step. We've found a term that didn't turn up in either of the bootstrap forest or the boosted tree methods. We also see that the lasso model has an RS quare of 0.9, whereas the bootstrap forest model had an RS quare of 0.8. We have a simpler model that has an easier form to understand and is easier to work with, and also has a higher predictive capacity than the bootstrap forest model. Now, the variable importance metrics in the profiler have been there for quite some time. The question that they answer is, "Which predictors have the biggest impact on the shape of the response surface over the data or over a region?" In JMP 17 Pro, we have a new technique called SHAP Values that is an additive decomposition of an individual prediction. It tells you by how much each individual variable contributes to a single prediction, rather than talking about variability explained over the whole space. The resolution of the question that's answered by Shapley values is far more local than either the variable importance tools or the column contributions i n the bootstrap forest. We can obtain the Shapley Values by going to the red triangle menu for the profiler, and we'll find the option for them over here, fourth from the bottom. When we choose the option, the profiler saves back SHAP columns for all of the input variables to the model. This is, of course, happening for every row in the table. What you can see is that the SHAP V alues are giving you the effect of each of the columns on the predictive model. This is useful in a whole lot of different ways, and for that reason, it's gotten a lot of attention in intelligible AI, because it allows us to see what the contributions are of each column to a black box model. Here, I've plotted the SHAP V alues for the columns that are predictive in the last fit model that we just built. If we toggle back and forth between the good units and the units that failed, we see the same story that we've been seeing with the variable importance metrics for this, that Column 18 and Column 27 are important in predicting pass or fail. We're seeing this at a higher level of resolution than we do with the variable importance metrics, because each of these points corresponds to an individual row in the original dataset. But in this case, I don't see the SHAP Values really giving us any new information. I had hoped that by toggling through the other failure modes, maybe I could find a pattern to help tease out different sensors that are more important for particular failure modes. But the only thing I was able to find was that Column 18 had a somewhat stronger impact on the Velocity Type 1 failure mode than the other failure modes. At this point, we've had some success using those Cluster 1 columns in a binary pass/ fail model. But when I broke out the SHAP Values for that model, by the different failure modes I wasn't able to discern a pattern or much of a pattern. What I did next was I went ahead and fit the failure mode response column test results using the Cluster 1 columns, but I went ahead and excluded all the pass rows so that the modeling procedure would focus exclusively on discerning which failure mode it is given that we have a failure. I tried the multinomial lasso, elastic net, and ridge, and I was particularly happy with the lasso model because it gave me a validation RS quare of about 0.94. Having been pretty happy with that, I went ahead and saved the probability formulas for each of the failure modes. Now the task is to come up with a simple rule that post processes that prediction formula to make a decision about which failure mode. I call this the partition trick. The partition trick is where I put in the probability formulas for a categorical response, or even a multinomial response. I put those probability formulas in as Xs. I use my categorical response as my Y. This is the same response that was used for all of these except for pass, actually. I retain the same validation column that I've been working with the whole time. Now that I'm in the partition platform, I'm going to hit Split a couple of times, and I'm going to hope that I end up with an easily understandable decision rule that's easy to communicate. That may or may not happen. Sometimes it works, sometimes it doesn't. So I split once, and we end up seeing that whenever the probability of pass is higher than 0.935, we almost certainly have a pass. Not many passes are left over on the other side. I take another split. We find a decision rule on ITM that is highly predictive of ITM as a failure mode. Split again. We find that whenever Motor is less than 0.945, we're either predicting Motor or Brake. We take another split. We find that whenever Velocity Type 1, its probability is bigger than 0.08 or likely in a Velocity Type 1 situation or in a Velocity T ype 2 situation. Whenever Velocity Type 1 is less than 0.79, we're likely in a gripper failure mode or an IMP failure mode. What do we have here? We have a simple decision rule. We're going to not be able to break these failure modes down much further because of the very small number of actual events that we have. But we can turn this into a simple rule for identifying units that are probably good, and if they're not, we have an idea of where to look to fix the problem. We can save this decision rule out as a leaf label formula. We see that on the validation set, when we predict it's good, it's good most of the time. We did have one misclassification of a Velocity Type 2 failure that was actually predicted to be good. Predict grippers or IMP, it's all over the place. That leaf was not super useful. Predicting ITM is 100 %. Whenever we predict a motor or brake, on the validation set, we have a motor or a brake failure. When we predict a Velocity Type 1 or 2, it did a pretty good job of picking that up with that one exception of the single Velocity Type 2 unit that was in the validation set, and that one happened to have been misclassified. We have an easily operational rule here that could be used to sort products and give us a head start on where we need to look to fix things. I think this was a pretty challenging problem, because we didn't have a whole lot of data. But we didn't have a lot of rows, but we had a lot of different categories to predict and a whole lot of possible predictors to use. We've gotten there by taking a series of steps, asking questions, sometimes taking a step back and asking a bigger question. Other times, narrowing in on particular sub- issues. Sometimes our excursions were fruitful, and sometimes they weren't. Our purpose here is to illustrate how you can step through a modeling process, through this sequence of asking questions using modeling and visualization tools to guide your next step, and moving on until you're able to find a useful, actionable, predictive model. Thank you very much for your attention. We look forward to talking to you in our Q&A session coming up next.
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Access
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
An Introduction to Spectral Data Analysis with Functional Data Explorer in JMP Pro 17 (2022-US-30MP-1158)
Monday, September 12, 2022
Since the Functional Data Explorer was introduced in JMP Pro 14, it has become a must-have tool to summarize and gain insights from shape features in sensor data. With the release of JMP Pro 17, we have added new tools that make working with spectral data easier. In particular, the new Wavelets model is a fast alternative to existing models in FDE for spectral data. Drop in to get an introduction to these new tools and see how you can use them to analyze your data. Hi, my name is Ryan Parker, and I'm excited to be here today with Clay Barker to share with you some new tools that we've added to help you analyze spectral data with the Functional Data Explorer in JMP Pro 17. So what do we mean by spectral data? We have a lot of applications from chemometrics that have motivated a lot of this work. But I would just start off by saying we're really worried about data that have sharp peaks. This may not necessarily be spectral data, but these are the type of data that we've had a hard time modeling in FE up to this point. And so we really wanted to focus on trying to open up these applications and make it a lot easier to handle these sharp peaks. Maybe potential discontinuities. Just these large, wavy features of data. Where in this specific example, with spectroscopy data, we're thinking about composition of materials, and these peaks can represent these compositions, and we want to be able to try to quantify those. So another application is from mass spectrometry, and here you can see these very sharp peaks. They're all over the place in these different functions. But these peaks are representing proteins from these spectrums, and they can help you, for example, compare differences between things from a patient with cancer and patients without cancer to understand differences. I mean, again, it's really important that we try to model these peaks well so that we can quantify these differences. An example that Clay is going to show comes from chromatography. This is where we're trying to ... In this case, we want to look at quantifying the difference between an olive oil versus other vegetable oils. And so the components of these things represented by all of these peaks, we need to, again, try to model these well. The first thing I want to cover are four new tools to preprocess your data, spectral data before you get to the modeling stage. The first one is the Standard Normal Variate. So with the Standard Normal Variate, we're thinking about standardizing each function by their individual mean and variance. So we're going to take every function one at a time, subtract off the mean, divide by the variance, so that they all have mean zero and variance one. This is an alternative to the existing tool we have standardized, which is just looking at a global mean in variance so that the data themselves have been scaled, but certain aspects, like means, are still there, whereas with the Standard Normal Variate, we want to remove that for every function. The next tool's Multiplicative S catter Correction. It is similar to Standard Normal Variate, the results end up being the same, similar. But in this case, we're thinking about data where we have light scatter. So some of these spectral data come where we have to worry about our light scatter from all the individual functions being different from a reference function that we're interested in. Usually this is the mean. So what we'll do is we will set a simple model between the individual functions to this mean function. You know, we will get coefficients that we can subtract off this mean feature, divide by the slope feature, get us to that similar standardizing the data, and in this case, focused on this light scatter. Okay, so at this point, we're thinking about what if we have noise in our data? What if we need to smooth it? So the models that we want to fit, for spectral data, these wavelets, they don't smooth the data for you. So if you have noisy data, you really want to try to handle that first, and that's where the Savitzky-Golay Filter comes in. What this is going to do is fit n- degree polynomials over a specified bandwidth to try to find the best model that will smooth your data. So we search over a grid for you, pick the best one, and then present the results to you. And I do want to note that the data are required to be on a regular grid, but if you don't have one, FDE will create one for you. We have a reduce option that you can use if you want some finer control over this, but by default, we are going to look at the longest function, choose that as our number of grid points, and create a regular grid from that for you. But the nice thing about the S avitzky-Golay Filter is because of the construction with these polynomials, we have easy access to the first or second derivative. Even if you don't have spectral data and you want to access derivative functions this will be your pathway to do that. And if you do request, say, the second derivative, our search gets constrained to only polynomials that will allow us to give you a second derivative, for example. But this would be the way to access that, even if you weren't even worried about smoothing. You can now get to derivatives. The last preprocessing tool I'll cover is Baseline Correction. So in Baseline Correction, you are worried about having some feature of your data that you would just consider a baseline that you want to remove before you model your data. So the idea here is we want to fit a baseline model. We have linear, quadratic, exponential options for you, so we want to fit this to our data and then subtract it off. But we know there are important features, typically these peaks, so we want to not use those parts of the data when we actually fit these baseline models. So we have the option here for correction region. I think for the most part you would likely use entire function. So what this just means is, what part are we going to subtract off? So if you select within regions only things within these blue lines are going to be subtracted. But I've already added four here. Every time you click add on baseline regions, you're going to get a pair of these blue lines and you can drag them around to the important parts of your data, and what this will do is, when you try to fit, say, a linear baseline model, it's going to ignore the data points that are within these two blue lines. So, function one, we set a linear model, but we exclude all these sharp peaks that we really want, that we're interested in. And so then we take the result from this linear model and subtract it off from the whole function. The alternative to that is an Anchor Point, and that's if you say, I really would like for this specific point to be included in the baseline model. Usually this is if you have smaller data and you know, okay, I want these few points. These are key. These represent the baseline. If I were able to fit, you know, say a quadratic model to these points, that's what I want to subtract off. So it's an alternative. When you click those, they'll show up as red as an alternative to this blue. But this will allow you to correct your data remove the baseline before proceeding. So that gets us to how we model spectral data now in JMP Pro 17, and we're using wavelets. The nice thing about wavelets is we have a variety of options to choose from. So these graphs represent what are called mother wavelets and they are used to construct the basis that we use, that models the data. So the simplest is this Haar wavelet, which is really just step functions, maybe hard to see that here, but these are just step functions. But this biorthogonal, it has a lot of little jumps and you can start to imagine, okay, I can see why these make it a lot easier to capture peaks in my data other than the Haar wavelet. All these have parameters that are changing the shape and the size of these, so I've just selected a couple here to just show you the differences. But you can really see where, okay, if I put a lot of these together, I can understand why this is probably a lot better to model all these peaks of my data. And so here's an example to illustrate that with one of our new sample data, a NMR design of experiments. So this is just from one function where let's start with B-Splines. This is sort of the go to for most data place to start in FDE. But we can see that it's really having a hard time picking up on these peaks. Now, there are, we have provided you tools to change where knots are at in these beastline models. So you could do some customization. Probably fit this a lot better than the default. But the idea is that now you've had to go and move things around and maybe it works for some functions, but not others, and you need a more automated way. So that's one alternative to that is P-Sp lines. That is doing that a little bit for you, but it's still not quite capturing the peaks maybe as well as wavelets. It's probably doing the best for these data. relative to wavelets and almost a model -free approach where we model the data just directly on our shape components, this direct functional PCA. It's maybe a bridge between P-Splines and B-Splines where it's not quite as good as P -Splines but it's better than B -Splines but this is just a quick example to highlight how wavelets can really be a lot quicker and powerful. What are we doing in FDE? We construct a variety of wavelet types and their associated parameters and fit them for you. So similar to the S avitzky-Golay Filter, we do require that the data are on a regular grid. And good news, we'll create one for you. But of course you can go to reduce the control that if you would like. But now the nice thing about once data are on the grid, we can use a transformation that's super fast. So P -Splines would also, for these data, be what you would really want to have to use, but they can take a longtime to fit, especially if you have a lot of functions and you have a lot of data points per function. But our wavelet models are going to essentially last so all of the different basis functions that construct this particular wavelet type with parameter and what that's going to allow us to do is just really force a lot of those coefficients that don't really mean anything to zero to help create a sparse representation for the model. So those five different wavelet types that I showed before, those are available, and once we fit them all, we're going to construct model selection criteria and choose the best for you by default. You can click through these as options to see other fits. And a lot of times these first few are going to look very similar and it's really just a matter of, there are certain applications where they know, "Okay, I really want this wavelet type," so they'll pick the best one of that type in the end to use. So the nice thing about these models is they happen on a resolution. They're modeling different resolutions of the data. So we have these coefficient plots where at the top they're showing low-frequency, larger scale trends, like an overall mean, and as you go down in the... Or I guess you go up in the resolutions, but down in the plot, you're going to look at high -frequency items, and these are going to be things that are happening on very short scales, so you can see where it's picking up a lot of different features. In this case, it's taking... A lot of these are zero for the highest resolutions. So it's picking up some scales that are at the very end of this function. It's picking up some of these differences here. But this gives you a sense of kind of where things are happening at both location and then is that high-frequency or low-frequency parts of the data. So the last thing we've added to complete the wavelet models that's a little bit different from what we have now is called Wavelets DOE. So if you've used FDE before, you've likely tried functional design of experiments and that is going to allow you to take functional principle component scores, connect design of experience factors to the shape of your functions. But now, wavelet models in particular, they're coefficients, because they do represent resolutions and locations, these can be more interpretable and they have more direct impact to understanding what's happening in the data, that may be a functional principle component isn't as easy to connect with. We have this energy function and it's standardized to show you, "Okay, this resolution at 3.5," just representing more on this end point. That's going to have... That's where the most differences are in all of our functions, and it's representing about 12%. So you can scroll down. We go up to where we get 90% of this energy, which is just the squared coefficient values that we just standardized here. But energy is just how big are these coefficients relative to everything else? But this, similar to Functional DOE, you can change the factors, see how the shape changes, and we have cases where both Wavelets DOE and Functional DOE work well. Sometimes Wavelets DOE just get the structure better. Doesn't allow for some maybe negative points that Functional DOE might allow in this example. So it's just they're both there, they're both fast. I mean, you can use both of them to try to analyze the results of wavelet models. But that's my quick overview. So now I just want to turn it over to Clay to show you some examples of using wavelet models with some example data sets. Thanks, Ryan. So, as Ryan showed earlier, we found an example where folks were trying to use chemometrics methods to classify different vegetable oils. So I've got the data set opened up here. Here we have our each row of the data set is a function, so each row in the data set represents a particular oil, and as we go across the table, that's the chromatogram. So I've opened this up in FDE just to save a few minutes. The wavelet fitting is really fast, but I figured we'd just go ahead and start with the fit open. So here's what our data set looks like. You can see those red curves are olive oils. The green curves are not olive oils. So we can see there's definitely some differences between the two different kinds of oils and their chromatograms. So, as Ryan said, we just go to the red triangle menu and we ask for wavelets and it will pick the best wavelet for you. But like I said, I've already done that, so we can scroll down and look at the fit. Here we see the best wavelet that we found was called the Symlet 20, and we've got graphs of each fit here summarized for you. As you can see, the wavelets have fit these data really well. But in this case, we're not terribly interested in fitting the individual fits. We want to see if we can use these individual chromatograms to predict whether or not an oil is an olive oil or not an olive oil. So what we can do is, we can save out these wavelet coefficients, which we've gotten a big table here, and there's thousands of them. In fact, there's one for every point in our function. so here we've got 4,000 points in each function. This table is pretty huge. There's 4,000 wavelet coefficients. But as Ryan was saying, you can see that we've zeroed some of them out. So these wavelet coefficients drop out of the function. So that's how we get smoothing. We fit our data really well, but zeroing out some of those coefficients is what smooths the function out. So how can we use these values to predict whether or not we have an olive oil? Well, you can come here to the function summaries and ask for save summaries. So what it's done is it saves out the functional principal components. But here at the end of the table, it also saves out these wavelet coefficients. So these are the exact same values that we saw in that wavelet coefficient table in the platform. So let me close this one. I've got my own queued up just so that I don't have anything unexpected happen. So here's my version of that table. And what we want to do is we want to use all of these wavelet coefficients to predict whether the curve is from an olive oil or from a different type of oil. So what I'm going to do is, I'm going to launch the generalized regression platform, and if you've ever used that before, it's the place we go to build linear models and generalize linear models using a variety of different variable selection techniques. So here my response is type. I want to predict what type of oil we're looking at and I want to predict it using all of those wavelet coefficients. So I press run. In this case, I'm going to use the Elastic Net because that happens to be my favorite variable selection method. And I'm going to press go. So really quickly, we took all those wavelet coefficients and we have found the ones that really do a good job of differentiating between olive oils and non -olive oils. So in fact, if we look at the confusion matrix, which is, this is a way to look at how often we predict properly, right? So for all 49 other olive oils, we correctly identified those as not olive oils. And for all 71 olive oils, we correctly identified those as olive oils. So we actually predicted these perfectly and we only needed a pretty small subset of those wavelet coefficients. So I didn't count, but that looks like about a dozen. So we started with thousands of wavelet coefficients and we boiled it down to just the 12 or so that were useful for predicting our response. So what I think is really cool is, we can interpret these wavelet coefficients to an extent, right? So this coefficient here is resolution two at location 3001. So that tells us there's something going on in that part of the curve that helps us differentiate between olive oils and not olive oils. So what I've done is, I've also created a graph of our data using... Well, you'll see. So what I've done is here the blue curve is the olive oils. The red curve is the non -olive oils, and this is the mean chromatogram. So averaging over all of our samples. And these dashed lines are the locations where the wavelet coefficients are nonzero. So these are the ones that are useful for discriminating between oils. And as you can see, some of these non-zero coefficients line up with peaks and the data that really tend to make sense, right? So, here is, here's one of the non -zero coefficients, and you can tell it's right at a peak where olive oil is peaking, but non-olive oils are not, right? So that may be meaningful to someone that studies chromatography and olive oils in particular. But so we like this example because it's a really good example of how wavelets fit these chromatograms really well. And then we can use the wavelet coefficients to do something else, right? So not only have we fit the curves really well, but then we've taken the information from those curves and we've done a really good job of discriminating between different oils. And so I've got one more example. Ryan and I are both big fans of Disney World, so this is not a chromatogram. This is not spectroscopy. But instead we found a data set that looks at wait times at Disney World, so we downloaded a subset of wait times for a ride called Disney's the Seven Dwarfs, Mind Train. And if you've ever been to Disney World, you know it's a really cool roller coaster right there in fantasyland. But it also tends to have really long wait times, right? You spend a lot of time waiting your turn. So we wanted to see if we could use wavelets to analyze these data and then use the Wavelet DOE function to see if we can figure out if there's days of the weeks or months or years where wait times are particularly high or low. So we can launch FDE. Here you can see, we've got each day in our data set, we have the wait times from the time that the park opens, here, to the time that the park closes over here. And to make this demo a little bit easier, we've finessed the data set to clean it up some. So this is not exactly the data, but I think some of the trends that we're going to see are still true. So what I'm going to do is I'm going to ask for wavelets, and it'll take a second to run, but not too long. So now we've found that a different basis function is the best. It's the Daubechies 20 and I apologize if I didn't pronounce that right. I've been avoiding pronouncing that word in public, but now that's not the case anymore. So we've found that's our favorite wavelet and what we're going to do is we're going to go to the Wavelet DOE analysis, and it's going to use these supplementary variables that we've specified day of the week, year and month to see if we can find trends in our curves using those variables. So we'll ask for Wavelets DOE, and what's happening in the background is we're modeling those wavelet coefficients using the generalized regression platform, so that's happening behind the scenes, and then it will put it all together in a Profiler for us. So here we've got, you know, this is our time of day variable. We can see that in the morning. The wait times sort of start, you know, around an hour. It gets longer throughout the day, you know, peaking at about 80 minutes, almost an hour and a half wait. And then, as you would expect, as the day goes on and kids get tired and go to bed, wait times get a little bit shorter until the end of the day. Now, what we thought was interesting is looking at some of these effects, like year and month. So we can see in 2015, the wait times sort of gradually went up, right, until 2020. And then what happened in 2020? They increased in February, and then, shockingly, they dropped quite a bit in March and April. And I think we all know why that might have happened in 2020. Because of COVID, fewer people were going to Disney World. In fact, it was shut down for a while. So you can very clearly see a COVID effect on Disney World wait times really quickly using Wavelet DOE. One of the other things that's interesting is we can look at what time of year is best to go. It looks like wait times tend to be lower in September, and since Disney World is in Florida, you know, that's peak hurricane season, and kids don't tend to be out of school. So it's really cool to see that our model picked that up pretty easily, right? So, but don't start going to Disney World in September. That's our time. We don't want it getting crowded. But yeah, so with just a few clicks, we were able to learn quite a few things about wait times, Seven Dwarfs Mine Train at Disney. But we really wanted to highlight that these methods were focused on chromatography and spectrometry, but there's a lot of applications where you can use Wavelets, and I think that's all we have. So thank you. And thank you, Ryan.
Labels
(12)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Access
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Cluster Analysis to Improve Local Government in Massachusetts (2022-US-30MP-1154)
Monday, September 12, 2022
The town of Sharon, Massachusetts, created a Governance Study Committee to recommend changes to municipal by-laws and governance within the town, particularly with an eye to elevating civic engagement among residents. I am a member of that committee, and in one phase of our work we sought to confer with officials in similar communities across the state to learn from best practices elsewhere. There are 351 cities and towns in the Commonwealth of Massachusetts, and we had limited time and no budget for comprehensive research. We quickly confronted the issue of how to best identify a modest group of communities closely comparable to Sharon, which in turn raised questions about which characteristics are most relevant to citizen participation in local governance. Using JMP with publicly available data, I conducted a two-stage project to select key variables and then used those variables to run cluster analysis to identify other communities for our research. Because we are an all-volunteer, appointed public body, the research had to be presentable in public forums, comprehensible by a lay audience. Several visualization features of JMP 16 were particularly valuable in that regard. This talk walks through the analysis, as well as my strategy to make clustering understandable. Resources Academic Case Study on this topic. Description to use with the data attached here. Hello. My name is Rob Carver, and today I want to share a story about a project I've been working on in my small town in Massachusetts. At the outset, I'll point out that the slides and the JMP data table are up on the discovery website and there's a new academic case study on this very topic that will be posted. It's not already posted. It will be posted very soon. What I'm hoping to do in 30 minutes is spend most of our time with a JMP demo, but you're going to need some context and background. I want to provide a little bit of a scenario, give you a sense of the problem that I'm trying to solve, and talk about the research strategy and then get into the demo and wrap up with some conclusions. I live in a town called Sharon, which is an archetypal New England town. Here you see a picture of our talent center. It was incorporated in 1765, so an old community. Like many New England communities, the legislative function of the town, is managed by the annual town meeting in which anyone can come and speak. M emorialized in the Norman Rockwell images. From the start, we have used an open-town meeting and the executive function is carried out by a three-member board known as The Select Board. But since Norman Rockwell's day, municipal government has become technologically, financially, legally more complex, even for the most fundamental services that a town provides. Attendance at the town meeting has really dwindled. About a year and a half ago, The Select Board created a governance study committee, of which I'm a member. we are doctors and lawyers and accountants, and teachers, and marketing people, local business people. I'm the resident stats guy. Overtime, the population of the town has evolved, it's grown . It's more diverse than it was 100 years ago. We've gone from an agrarian manufacturing community to a bedroom community for the city of Boston. Lots of professionals working in hospitals and universities, law firms in the city. People tend not to live and work in the town, and that has impactsq on participation in town governance. The charge to the governance study committee is find ways to boost citizen engagement. we've been doing our due diligence. We've been researching, we've surveyed residents, we've read the literature, we've interviewed town officials. One part of our research, and that's what this talk is about, is we wanted to reach out to towns like Sharon to find out what are they doing, what's their experience. There's some comparative research. There are 350 towns in Massachusetts. We have time constraints, and so we're looking for a way to identify a smallish number of communities that are similar to us. We didn't want to reinvent the wheel, but we thought that modernizing it some would be a good idea. The driving question covered in this research is which towns are similar to this town. A little bit about Sharon. We sit in South-eastern Massachusetts. We are not too far from Plymouth, which is where the 1620 May Flower landing happened. This community was originally populated by Wampanoag peoples. Europeans arrived in 163 7. We're about halfway between Boston and Providence. For the sports fans out there, we are next door to where the New England Patriots play football. Population about 18,500, which is quite average in Massachusetts. We have great percentage of the voters of the population are registered voters. Yet out of all those people, we get 2% for a town meeting. Most recently in May of 2022. This was the scene and a lot of that is COVID related. There was social distancing rules in effect, but turnout is low, partly because of COVID, partly because of factors that we don't fully understand. One task for the governance study community is to consider other alternatives to town meeting, or tweaks and enhancements to town meeting. Under state law in an open-town meeting to participate, you have to be in the room. It's broadcast on local television, but you have to be present to speak or to vote. State law also says there's three ways to run local government. 74% of the communities, the large majority do what Sharon does. Open-town meeting once or twice a year. A small number have what's called representative town meeting in which voters elect their neighbors, maybe a few hundred of them, to participate and vote in town meeting. Traditionally cities have had small councils with a mayor or administrator of some kind. Increasingly that's being adopted by towns, and so we're looking into that. For this talk, the task is identify peer towns, that we can then we could then interview and consult with and reach out to them. I mentioned some of the state legal constraints. One other constraint is the town boards, like a government study committee, have to have open meetings. Anything we do and decide and deliberate about has to be in public, which is a good thing. We have no budget. We have some wonderful staff in the town hall, but they are busy doing other things as well. Data availability was a mixed story. Plenty of data available about characteristics of communities. We're really interested in how many people participate in local government and there's no centralized data about that, so we needed to hunt for proxies. We also had no ability to compel folks in other towns to meet with us, advise us, or share data with us. We're operating in a topic area that is heavily governed by tradition. People really cleave to that Norman Rockwell inch. We came up with a three- stage plan. As a committee, we brainstorm variables, say why do people participate? Why don't people participate? Why are different towns different? I then grabbed some data from voter turnout in a recent state-wide election to use as a proxy for citizen engagement. Ran some models in JMP to identify those variables that seem to have predictive value. The committee then discussed and added some more variables that they thought were important on the town meeting dimension. That generated 20 predictor columns, which I knew was far more than I wanted to deal with. I consulted my brain trust some academic colleagues special thanks go to, Mia Stevens and Ruth Humble at Chomp, who advised me on principal components analysis, which I'll note that the outset was not part of my comfort zone, so I told her a little bit about that. Then I ran cluster analysis. That's the main event today. People on the committee understood that we probably want to be talking to towns of comparable size, but there's more to similarity than size. There's more to similarity than being a geographic neighbor. Part of the work involved instructing the committee a little bit on cluster analysis. Just in case anybody watching doesn't have much background in this, here's how I did it. I said well, we can look at population and something else at the same time. and maybe though that something else has an impact on participation. In this case, the Y axis was a single family property tax bills. You can see that there's a bunch of towns similar in size to Sharon, but which might have very different tax impacts. The idea and cluster analysis, if you are going to work in two dimensions, choose two attributes that you think are relevant to your query, spread the towns out on those two dimensions, and then identifying a reasonable number of towns that are reasonably similar to Sharon. That's a big idea in cluster analysis. Fortunately we're not limited to two attributes or two dimensions. We can have more than that. with that, I think you now know enough to follow the demo. Where we're walking into this demo, I had used gathered data from a variety of state and publicly available sources. Used query builder to build a large data table inspected for outliers and missing data. The one real outlier is the city of Boston, which is just unique. T hat's excluded from all the analysis. A little bit of missing this, but nothing terrible. I'm going to be showing you a JMP project. Let me switch gears, move into the demo and I hope that I do this correctly. What we're looking at is my data table of 351 cities and towns. The first several columns are identification, size of The Select Board, their legislative option name of community. The next 20 columns are our predictors. Just to round us a bit, if we look at some basic descriptives of the communities, towns in Massachusetts tend to be on the small size. The medium is only 10,000 people. Sharon is quite near the mean community size. Terms of legislative function, 74% use open- town meetings, so we are in good company, and in terms of the size of The Select Board which is, another thing, the governance committee is looking at just about 50/ 50. Half of the town's with a Select Board have three members, half have five. We've got these 20 predictors. One issue that comes up pretty early in the analysis is the linearity. Here I have five variables that all speak to the size and the electric, the size of the town. You can see that there are some very strong correlations. We generally speaking, don't want to deal with so much of the linearity. One way out is principal components analysis. At this point, not quite ready to jump into clustering, but want to take those 20 columns and distil them down, conserve as much information as possible, but reduce the redundancy and collinearity across columns. To do that, principal components analysis is an excellent option. I don't have the ability today to give a full crash course in principal components analysis, but we can see that we have variables that seem to be overlapping in terms of their message. We also can see that when you give a PCA 20 columns, it initially comes up with 20 components. The first few of which seem to capture most of the variability. We have to make a decision about how many principal components to use and what they represent. For this, the screen plot is helpful and we're looking for a kink or an elbow in the plot. That seems to happen somewhere down here, around 4, 5, 6 components. If we consult the loading matrix to see how different variables associate load into different components, we can begin to subjectively assign meaning to the components. I'll cut to the chase. We selected six principal components as being informative for the purposes of cluster analysis, and came up with some interpretations that made sense to us. Things like how big is the town? How affluent is the town? How fast is it growing? Now we're ready for clustering. Back to JMP. There are two basic approaches to clustering. You might think of them as a top- down and a bottom- up. Both of them take the raw data, standardize it, and then compute Euclidean distances for each pair of rows in the data table for each pair of communities. Taking into account six factors, six principle components. Size, affluence, education, things like that, which ones are similar to Sharon in hierarchical clustering? The report starts us off with a dazzling graph that with 350 rows, this is hard to interpret. Let's begin we'll come back to it. Begin with something that's a little easier to interpret, which is the cluster summary. In the hierarchical method, JMP has found for us 16 clusters. I can tell you because I peaked that Sharon turns out to be in cluster 15. Sharon and 23 other towns. For example, if you scan down the affluence column, we see that, again these are standardized scores, these are the most affluent towns in the state. If we come over here to the growth column, and this is largely growth in the housing units and population. The least growth, in fact, some negative growth. If we come back up here. Now, having looked at the cluster summary, all of the clusters have been identified and colored. JMP gives us a cut point. If we zoom in on cluster 15. Let's make that a little bigger. Sharon is here in the center. It's nearest Euclidean neighbor is Winchester, which is about an hour's drive. We now have a provisional list of towns to consult with. All right, so that's a crash course in the hierarchical clustering. I'll move out to the K-Means. Hierarchical is bottom up. We start with 350 individual towns as clusters. We interrogate the distance matrix, all the parallelized distances, find those two towns that are nearest to each other, they form a cluster. We take the mean distance of that cluster. Now we're either looking for the next two nearest towns or the next town that's closest to that cluster, and iteratively process for the tree until we have one gigantic cluster of 350 towns. With K-M eans clustering. We flip the process. We start with 350 towns in one cluster and then begin slicing and dividing in multiple dimensions. In this approach, same utility and distances, same distance matrix, we end up with Sharon being in cluster number 4, with a full compliment of 33 towns. We automatically get a cluster means picture. Again, very affluent low growth, not necessarily the lowest, but low growth again. We get slightly different results. I think in the interest of time, I will show you one other graph. There's various things to look at, but let's look at the parallel coordinate blocks. What is this tool? We have 16 clusters, and by the way in K-Means, it's up to the user to specify the number of clusters. I chose 16 as a starting point because that's what hierarchical gave us. H ere we are, cluster four. The dark brown line is Sharon. Here we see the six characteristics, the six principal components. F or example, if we compare, how is cluster 4 different from cluster 3 let's say, or cluster 5, maybe similar in sizes. Cluster five, less affluent. Property values are a little lower. Permanent population refers to their communities with universities, hospitals, prisons, so forth, vacation homes, snowbirds who leave for the winter. Towns differ in terms of their permanent populations in cluster three, much lower. Here's where we find our university towns. I just popped up the town of Shirley, Massachusetts as the state's largest maximum security prison. I don't know if we consider these folks permanent residence or not, but any event. We've done two different clustering methods. Let's take a look at how the results compare. I saved within each clustering method, the cluster assignments for each town, created binary variables. Are you in the same cluster as Sharon or are you in a different cluster? I also, just as an aside, JMP has lots of wonderful built in geographic maps. It does not have a built- in map showing municipality borders within the state of Massachusetts. But it turns out that with JMP it's fairly easy to create a new geographic map. I was able to do this without very much work at all. Here are the results of hierarchical clustering, cluster 15. Sharon is here. It's similar towns are in blue. I was pleased to see that my little tiny hometown of Marbleh ead is similar to the place I moved to. Hierarchical clustering 15 gives us these 24 towns. K-Means clustering and some more towns, but there's an awful lot of overlap. I also, just out of curiosity, look to identify the 33 towns about 10% of the state that is most similar to Sharon. This is a larger group. Again, an awful lot of repeats. A lot of repeats. That last approach also gave us some other advantages. I want to now shift back to PowerPoint and talk about some of those. O ne last point before I finish the demo. We were also curious to ask... Mostly our goal was, who shall we interview? Who shall we call in to meet publicly with our committee? But while we're at it, let's see what our peers do in terms of governance. State-wide, open-town meeting— OTM— dominates 74% and there is no dominant fourth size. If we look at who's in our cluster, let me use K-Means, because it's a little bit larger group. That 74%, jumps up to 85% with time meeting. By a two to one margin, towns have five-member Select Boards. Now, this isn't definitive as to channel my mother. If all the other towns jumped off the Empire State Building, we wouldn't necessarily want to jump off. But it's interesting to note that the towns most similar to Sharon favor the five-member Board and are even more inclined to open- town meeting. With that, let me get back to some conclusions. So what did we learn? One thing we learned was the geographic proximity is uninformative in those maps, none of the abutting towns came up blue. Our most similar communities are not our next door neighbors. As I just noted, open-town meeting in five- member boards really predominate. So what did we do? This work actually happened several months ago. We were able to prioritize our outreach, begin contacting those towns most similar to us. Many were extremely cooperative and shared a lot of information and data. We also didn't want to assume that open-town meeting was the only way to consider, so we wanted to talk to people with representative or councils. Those Euclidean distances became instructive in terms of, okay, none of our immediate neighbors, closest neighbors use town council or representative town meeting. But which RTM town, which council town is most like us? And we contacted those folks as well. W e went from having to contemplate outreach to 350 towns in a limited amount of time and with no money and staff, to focused sampling method. Then because town officials talk to one another and they are professionally active, that led us to other interviewees. With that, I think that's about my time. I hope this has been interesting and constructive. Thank you for coming and I hope you enjoy the rest of the program.
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Exploratory Data Analysis and Visualization of a Large, Online Vaccine Database using JMP Pro 16 (2022-US-45MP-1153)
Monday, September 12, 2022
Recent events have brought about much discussion in both the popular press and the scientific literature about the safety and efficacy of some recent vaccination programs. One frequently referenced data source is the Vaccine Adverse Event Reporting System (VAERS), which now covers more than 20 years. In this session, we will demonstrate the Exploratory Data Analysis and Data Visualization capabilities of JMP Statistical Discovery software. We will begin by using the Subset, Join, and Concatenate platforms from the Tables menu, followed by Graph Builder from the Graph menu. Finally, we will make use of some Screening platforms from the Analyze menu. In general, we will show how to use JMP’s “front end” data selection and management tools, drag and drop interactive graphs, and linked analyses to speed the time to discovery for a large, complex dataset. Good morning. Good afternoon and good evening, everyone, wherever you are. My name is Stan Siranovich, and I am the principal analyst at Crucial Connection L LC, and I'm doing this presentation from Jeffersonville, Indiana, right across the river from Louisville, Kentucky. Today I'm going to talk about how to do an exploratory data analysis and visualization of an online vaccine data base using JMP PRO 16. That online database is the VAER system. So what is VAERS? That is the Vaccine Adverse Event Reporting System. It's a national early warning system to detect possible safety problems in US-L icensed vaccine. That is the same database that has been in the news for the last year or two. Let's talk about why it was developed. First of all, it was to detect new, unusual or rare vaccine adverse events, monitoring increases in known events, identify potential risks, and assess the safety of newly licensed vaccine. Nowhere in those goals do you see anything about make data analysis vaccine data set easy, which is why we're doing this presentation. Now, it's maybe structured a bit differently than some of us are used to. I came out of the lab and production facilities, and I'm used to, for lack of a better word, chemical scientific, rather small data sets. On rare occasions, when I did work on something larger, somebody in the corporate hierarchy cleaned all the data for me. That is not the case here. It's organized by year, and there are three tables per year. There's facts, there's data, and various symptoms. What I did was, the first week in June, I downloaded all the data as of May 31st, and this is what it looks like. You can download to your heart's content. You do have to sign in. Over on the right side of the screen, let's see what I got. Now, notice in years here, 2018, 2019, 2020, the zip files are roughly the same size. By the way, this zip file is simply these three files here zipped into one. That is normally the best way to work on it. Download zip and unzip it. But notice what happens between 2020 and 2021, which is at about a 12, 14 percent increase. We go from 11.2 Meg up to almost 169 Meg. So something's going on, and we would like to take a look at it. This is what it looked like on my desktop. Now let's talk about tables there. I mentioned there are three tables per year. There's vax; contains all the vaccine information. It's got information, such as, almost 100 % unique VAERS ID at the top, manufacture, lot type and the data is where a lot of the data that we're going to be interested in resides. Notice it's got the VAERS ID again, and it's got some different columns that we're interested in, such a state, age in years, sex, symptoms, which is a free form of a text field that sometimes seems to go on forever, whether or not the patient d ied, et cetera. Then VAERSYMPTOMS, contains just the VAERS ID and symptoms, and they go from 1-5, and sometimes they continue on from 6-10, and we will address that issue towards the end of this presentation. Let me get out of that, drag that over. Right now, you should see the JMP window open. I am going to present from a JMP project. Let me go over that very quickly. By the way, I assume everybody watching this has seen a JMP window before. This is my workspace. I drag the three files in and opened them up in JMP down here as contents. I opened up a new instance of this. When I'm working on a project, I drag my links and maybe some PDFs or whatnot into that space. Then at the bottom, we have recent files. But mainly what we want to do, Iis look at this window, and the main window here, notice we have tabs here, and you click on the tab just like a spreadsheet and we can see our different sheets. Let's start cleaning. First thing I'm going to do is make... Where is it? Where there is VAX. Make that the home table, and you'll see why in just minute. So I'm going to start there. Now notice in VAX_TYPES , everything's mixed in together. We got the COVID-19, which we are going to focus on, HPV9. Scroll down a little bit, we have unknown. We have Flu X, et cetera. So what we want to do is separate up to COVID- 19. The way we do that, and I am going to go up to Rows Row Selection, Select Where. By the way, in JMP, there's almost always several ways of doing things, but to keep things uniform, I will always go up to the menu bar to do this type of thing. What we want to do is separate COVID- 19. So I select VAX_TYPE. From the dropdown, I'll just leave the default to equals, and I'll put in COVID-19 and I'll come down here and uncheck the Match case. Let's see, it's checked over the window. It tells us what we're going to do; select rows in the data table that match specified criteria. It looks like it. Let's see. Click OK. And there it is. Notice it selected all the rows. It skipped the HPV9 and the unknown here, et cetera . What we want to do, now that we selected them, is to separate them out. So I will go to Tables, Subset, and it tells us what we're going to do. We're going to create a new data table from selected rows and columns. I went to Selected Rows, which check for us already. Notice it says here, we can save the script to the table. Normally, that's a good practice. Makes it much more convenient to repeat things. But I'll leave that unchecked for now, since this is a demo. Is there anything else I need to do? Yes, it says, Output table name. For simple analysis, JMP will take care of that for you. But for anything it starts to get a little bit complicated. I recommend deciding on some sort of a naming scheme. So rather than Subset, I am going to name that COVID- 19 only. Click the OK button. Here it is, COVID only. We can scroll down, and verify that. Notice down here in All rows, we've selected 129, 975. So keep that one in mind, because over here, we started off... Oh, no, it's data. Where was it? Back. Right here. We started off with 146, 500. We just got the COVID right there. We'd like to see what's going on. Notice here, we don't have any symptoms, we don't have any adverse events. What we're going to have to do is get them out of the data, and that's right here. Now let's take a quick look at that. We've got the columns displayed over here. We've got all sorts of things. They died, length of stay, onset date, et cetera and here's the VAERS ID. Now what we're going to do is to join the two tables on the VAERS ID. Let's go back over here to our COVID- 19 data only. There's VAX. We've got that. Now what I'm going to do is go up here to Tables. We want to come down here to Join tables. I won't get into that database stuff, but let's just say we want to join tables. Now I went to VAERS VAX , and did it because it says, up here, "Join this with VAERS VAX," and down here, what I have to do is a couple things. First of all, let me now save that to later. Go down here and select VAERSDATA . Notice we have some windows that pop up here. That shows us all the rows in the second data table. Now we have to match the rows. We're going to come here in the COVID- 19 data table, the VAERS_ ID, and then jump into the end. So we're going to click here and they are two separate windows, so you don't have to get to control. What we want to do is match them so that things don't get mixed up. Let's look down here, and we get to some data table stuff. It tells us that it is an inner join. An inner join, selects rows common to both. But I'm not sure about this data. So what I want to do is come over here to Main Table and select Left Outer Join. That's going to keep all the ones in the original table, which was the VAERSVAX table, and all of the matching entries in the VAERSDATA table. Let's look a little bit. Again, we can save script. Let's give it a name. Let's call that one, VAX join DATA and see if there's anything else we need. Yeah, you know what, we could do this in a two- step process. But why don't we do it all in one? Let's go back up here to the VAERS VAX, and what we'll do is we'll keep the VAERS_ID, because we don't want to know what that is. We don't need TYPE, because they're all the COVID-19 . Well, we may want to look at different manufacturers in the lot, whether it's Series One, Two, Three, or Four, whatever, a VAX_ROUTE and VAX_SITE. We won't worry about that for now. Come down here to the other data table and see... What do I want to check? We don't need the VAERS_ ID again. Let's check STATE, age in years. Notice, there's a couple other age columns, but we'll leave those go for now. We don't really need them. We want to know the response by sex, probably. Don't know, SYMPTOM_TEXT? Yeah, we'll keep that in. We certainly want to know whether or not the patient died. They died. Let's do HOSPITALDAYS , which is how many days they spent in the hospital if they went to the hospital. Let's do NUMDAYS. Now we know, from checking the VAERS website. NUMDAYS is simply the difference between VAERS state. and the onset of symptoms states. So that tells us, how far after the vaccination. And we've got a whole bunch of other fields here, and I don't think we need any of those. Let me scroll up, make sure everything's still checked. I hope I have everything that I need. And I'm going to click this one here . It says, "Select columns for join table," in case you can't see it. I'm going to hit Select and put them all in, I hope, and we'll scroll down to check. A gain, we can save the script to table, but let's hit OK. We named it VAX join DATA. There it is, VAX join DATA over here. Now notice we have the manufacturers, the dose, the state. Let's look at state. Have a number of missing values and there's some other considerations, there too in that column. But we'll get to those later. Let's see, SYMPTOM_TEXT, HOSPDAYS . We're all set. Let's look at this. Let's expand this. This is just a free- form column. Let's make use of one of my favorite features; Show Header Graphs. Let's hit that. And wow! Let's see what we have here, let me pull that over. We have, let's look at manufacturers. We want to see if maybe one manufacturer or neither has more adverse events. It looks like BIONTECH is by far the largest. Then we have some unknowns. Look at this. We have almost 8,000 VAX_LOTs. So if you want to do an examination by lot, that's going to be rather difficult. We see VAX_DOSE_SERIES here, and it looks like Series 1 has more than series... Though, that's a little strange. Then comes 2, then comes 3. How did that happen? Now we just note that and move on. Notice for STATE, we have 1, 2, 3, 4, 5 plus 54. Well, last I heard we didn't have 59 s tates so were going to have to check that. AGE_YRS looks okay. It looks like we have a whole lot more females than males. SYMPTOM_TEXT , lot of stuff there. We noted all that. We could close that and let's do break up the monotony here, the cleaning. Let's do an analysis. Let's go to everybody's favorite platform; Graph Builder. We go up to the menus, select G raph and select Graph Builder from that. Let's see what do we have here. How about I did bring in hospital days, I hope. Yes. Let's do hospital days. We'll select it, put it on the y, and let's see if there's any differences between amongst, or I should say the states. Didn't want to do that. Just wanted to select STATE. Put that in the x, and I'll select Bar Graph. There are the bar graph. Notice we do have some anomalies here. Let's look down at the x-xaxis. We have State AS , State MH, State PW, State QW. That isn't right. So we know we have to do a little bit more cleaning on that. But this is a demo, so we'll leave those in there for now. That's pretty easy thing to figure out. What we want to do is... Come on, keeps popping up. You can be a little more careful about that. Come up here, and I right- click on the x-axis and come up here to order by and to hospital days descending. There it is. We have some unusual results here. It looks like Wyoming. Oh, I should point this out too. JMP automatically selected the mean for us. Which for our purposes is probably okay. It looks like the mean hospital stay for the people who suffered an adverse event Wyoming is, I don't know, 20 in a fraction days. Which, man, that's high. Come down here to the next one, Vermont, and it gives us the number of rows. Put that in automatically for the hover label. A fter that, it's Mississippi, and see Oklahoma and Utah. Let me find... Let's see, New York, Pennsylvania. They're all down in here. For some reason, I don't know if it's just the chance or what, but we make note of the fact that a couple of the sparsely populated states seem to have longer hospital stays. Let's go back to here. And this is another reason why I like to use the JMP projects. You don't have to go hovering over closed windows. Let's go to Graph. We want to do that graph [inaudible 00:20:22]. Yeah, let's do one more example. Go to Graph Builder, and we'll take STATE, put that in the x-axis. And we look at what the JMP did for us. And it looks like we have some high numbers here. And we hover over California, for example, and it says it's got 10,629 rows, and it lists them, and it gives us the state. What JMP did here was automatically put the number of rows in there for us, which is a reflection of the number of patients who reported an adverse event. Let's see, Ordered by state, descending. So let's do that order again and take a look at it. This makes sense. California is far and away the leader with, unfortunately for them with adverse events, but it is also a highly populated state. The next one is Michigan, the same deal Florida. So it seems to somewhat mirror the population. There's New York. We'll just leave that open for now and move on. Now let's take a look at the VAER symptoms. Click that tab. First thing we noticed, and I did not plan this, but notice VAERS_I D, row one and two. It's the same number. How can that be? That is supposed to be a unique identifier, but that is somewhat common in the VAERs data set. Let's take a look why. By the way, I think this is an excellent reason to always spend some time, even if you're in a hurry, to look at the data and see if anything looks a little bit weird. And sure enough, two rows with the same row number. Now notice symptoms. This is the SYMPTOMVERSION, or rather, this is the MEDRA database. It's how the medical coders code the symptoms. So there's some degree of uniformity nationally and also internationally. This is the MEDRA version for this entry right here. And there are only two versions that I've run across so far in my work with VAERS , and that's 24, one and 25. So we have symptom one, we have symptom two, chest pain. We have symptom three, the heart rate, and it goes on and on. Then we come back and we have some... Looks a little weird here, SARS-CoV-2 and whatever in this duplicate row with the duplicate number ending in 266, which is really not a duplicate, because there is only one entry out of the five in that row. So that's a bit disconcerting. But we're going to take care of that. What we're going to do here, is a feature in JMP, and it's stacking the variables. If we wanted to do in analysis on symptoms from this table, what we would have to do is go and run it on each one of the columns, et cetera, et cetera. But if we stack the columns, we won't have to do that. So let's stack the columns. So we come up here to, again, Tables, and we come up to Stack, and we select Stack. So let's pick SYMPTOM1. Hold on to control or command if you've got a Mac. Up here to five, take a look at our check boxes. We want to keep everything, again, s ave script to source table, if we want to, and in case something goes wrong, we may want to keep dialogue blocks open, but I will not do that. Now, ask us where to move the columns? Well, t o last. But it would be much more convenient if we moved them after the VAERS_ ID. So I'll click that radio button, click the VAERS_ID, I'll put table name, and we'll just label that, how about SYMPTOMS STACKED? We'll come up here, put them in, five, see if there's anything else. Oh, yeah, new column names. If you're stacking several columns, which is often a case, especially if you're trying to pull in some data from a PDF file that was made for human consumption, that could be an issue here. But for right now, we'll just leave the stacked column labelled, D ata and the source column labeled, Label, because it makes sense. And one final check, I hit the OK button, and there it is. Well, let's take a look at that one. Here we go, the VAERS_ID, and sure enough, we have the label, and here's the data. Here we have our favorite row, the one that ends in 266. Let's take a look at that. What happened? One goes from 1-10. So we have ten instances of the same row, hence table 1, 2, 3, 4, 5, which are filled in. And then it starts again with SYMPTOM1, and the rest of them are empty. So that adds up to the 10. Here's our MEDRA data. That makes sense. I know, because I did it before. This is where we want to be. So let's go up to Rows again. Row Selection, select WHERE. In the case of the nomenclature here sounds an awful lot like SQL, that is what JMP is doing under the hood. We want to select some rows. So let's pick CHEST PAIN, We'll leave that all in caps and we come down here again. There's this little check box, hidden way down here, this Match case. I don't want to match case, because I don't know how people enter the data and how the good people at the CDC, clean the data before they posted it on the website. But by the way, the VAERS data set is not instantaneous loaded up. It doesn't go there. The CDC usually updates it about once a week, and they clean the data, then update it. So Match case, CHEST PAIN. It should say something about the dropdown. We want equals, but we could use does not equals, or whatever it is that suits our purpose. So let's click, OK. Wait a minute. Data equals. That should do it. There it is. Here I hover over it, it says SYMPTOM STACK . And we look down here at rows again, I find myself referring to that quite frequently, just to get an idea what's going on. We started off wit h 890,000 plus rows and we have 3,667 that were selected. Let's subset. Go up here to Tables again, Stack, where it is Subset, we will select subset. Of course, we get our pop up window that tells us what subset is going to do for us. And we click on that. Let's see, we went to Selected Rows. We could link that to the original data table, save script to the table, Subset of SYMPTOMS STACKED . Let's go with our convention, and we'll call that CHEST PAINS of SYMPTOMS STACKED. I want to make sure I did everything right. No. I want to call that CHEST PAIN. I hope that's right. Click on the OK button, and there it is. We'll take a look here. Notice we only have one row with 266 and we have all our other rows with CHEST PAIN in it, and we've got 3,666 selected. Now let's go back to the data, do a bit more analysis after we did all that cleaning. VAX join DATA. That looks right. Let me drag that over there. Let's look at a couple of different variables, or rather the graph. This time, let's do the summary tables. So let's go up to Tables and Summary's at the top. And this is what it looks like. Well, let's look at... What do we want? Age and HOSPDAYS and NUMDAYS. I am going to put that in here, but before I can do that, JMP wants us to tell it what statistic to use. Let's use the mean. I selected mean from dropdown, and it's going to give us the mean of those three columns. Now there are some other columns here that we have an interest in. I'd say, well, DIED . Yeah, probably interested in whether or not somebody died. See what's going on there. So we'll select that. Now we can't take a mean. It's a binary categorical variables or variable, rather. So let's select N and see . No, wait, let's look at STATE. Where was it? Where was STATE? There it is. Now STATE, we have 50 plus and we want a summary table . So 50 plus summary table. Let's put STATE into group. And again, we can save the script, but we won't do it here. One final check, that looks right. We hit the OK button. In here is our summary table. Notice that STATE again, we have some serious cleaning to do. There's no state here, State GU . There's state MH , again. Some may have missing values, et cetera, et cetera. And it gives us a mean age in years for all those states. And it looks like the mean age there, is somewhere in the 40 s thfough to 50 s. Come over here. Let's take a look at the mean hospital days by state. And that makes sense. 5, 6, 7 days. Looks like we're all hovering about a week. Just taking a quick look at it. I don't see any outliers. And come over here. Do the same with NUMDAYS . And that's the number of days between vaccination and when they say the effect appeared, and let's see, number died. One thing we want to notice is right here. Fortunately, there are not a whole lot of people, thousands dying. There's a couple with a hundred here, but notice here, N(DIED) with the blank state. Apparently there's a lot of state with number of died missing. So if we want to analyze that in a bit more detail, we're going to have to be careful there, because there's a lot that can't be assigned to a particular state. And what else can we do? Let's just take a look at all that. And we see our states again, these five plus 254. We have the number of rows up here. Let's see, what else can we look at? Mean(AGE ), haven't mentioned it yet, but JMP is interactive. So I click on that bar, and let's see, here's some people up here in the old age range, older age, I should say, excuse me, or people I can come up here. It's a little hard select that one, maybe. I don't see anything that sticks out at me. Fortunately, number died, it's pretty big bar at zero. So that's good. And the mean for NUMDAYS between the vaccine and yet first event is rather low, right here. We'll just leave it at that. I see that I am just out of time. So what did we do? We looked at a large online database. We were able to download the ZIP file on to our desktop, open them up in JMP. We were able to do some rudimentary analysis after spending a lot of our time data cleaning. Notice, even though that we spent a lot of our time cleaning the data, we were able to do it in JMP, which of course, is very convenient. Because, number one, we didn't have to switch, I'll switch here, switch there, run some SQL code and bring it back in. Number two, we could do our analysis in line as a bar while we're cleaning the data. We say, "Oh, that looks interesting." We went to our Graph Builder, took a look at the data, see if anything peaked our curiosity or if anything was out of place. When we finished our examination, we could continue with our data cleaning. In this case, we went on to the SYMPTOMS. So that concludes my presentation. I hope everyone enjoyed it and hopefully learn something. Maybe I should put a disclaimer here at the end. This is the way I would do it. After using JMP for a few years, it's not necessarily the way you should do it , and not necessarily the best way to do it. But using JMP, we were able to bring down some data and analyze it, clean it. It's some data that's quite a bit in the news right now. I thank you all for watching.
Labels
(13)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Access
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Functional DOE: Tips and Tricks for Design and Organizing Data (2022-US-45MP-1152)
Monday, September 12, 2022
The Functional Data Explorer (FDE) in JMP Pro allows for analysis of a DOE where the response is a curve. The entire functional DOE analysis workflow can be done within FDE - from smoothing response curves all the way to fitting the functional DOE model and optimization with the Profiler. But what about setting up the design and organizing your data for functional DOE analysis? This presentation will help you understand options around functional DOE design using the Custom Design platform and organizing your data using table manipulations such as Stack, Split, Join, and Update. Hi, everyone. My name is Andrea Coombs. I'm a Senior Systems Engineer, supporting customers from major accounts in the eastern part of the US. Today I'm going to be talking about functional DOE and specifically around design of your functional DOEs and how to prepare your data for analysis. I'm going to turn off my video for presentation here. Let's go in and look at the goals. Really, the goals are very simple here. I want to cover some tips and tricks for setting up your functional DOE using the Custom Design Platform and also give you some tips and tricks for adding functional data to your DOE data table. I am going to be using J MP Pro 16.2 during this presentation. Let's start off by defining what is functional data, what is functional data analysis, and specifically, what is functional DOE. Functional data is really curved data. It's any data that unfolds over a continuum, and there's a lot of data that is inherently functional in form. You can think about time series data, sensor streams from a manufacturing process, spectra data that's produced by lots of different types of equipment, measurements taken over a range of temperatures. And the Functional Data Explorer in JMP Pro makes it really easy to solve many kinds of problems with functional data. Here we have an example of some functional data. Here we have a plot of Home Price Index for New Jersey, from 1990 to 2021. Llike many functional data, we don't get it as this smooth curve, like we see here, but rather we get a series of discrete Index values. So we get one value that represents the value for the X, the year, and the value for Y, the Home Price Index. With functional data analysis, it isn't typically just one point of the curve that we are interested, or even the collection of points from a single curve. We typically have a collection of curves. Here you can see we have the Home Price Index over time, for the 50 states, plus the District of Columbia. And when we're doing functional data analysis, we want to understand the variability around these curves. Often we want to understand what are the variables that drive the variability in our curves. Or maybe you want to use the variability in our curves to predict another outcome. Functional data analysis is going to use all of the information contained in the curves. We're not going to leave any information behind. To model the curves directly, we can treat the curves as first class data objects, in the same way that JMP will treat traditional types of scalar data. When I'm thinking about functional data analysis, I like to break this down into four steps. The first step is to take the collection of curves and to smooth the individual curves. The next thing is we'll determine the mean of those curves and the shape components. These shape components represent the variability around the mean. Next, we extract the magnitude of each of these shape components. Knowing the magnitude of the shape components, the function that describes the shape component, the function that describes the mean, we are able to reproduce all of our curves by just knowing these two shape component scores. Now we can use the shape component scores in an analysis. Here what I've done is I've done a cluster analysis, and I've defined four groups of my curve shapes. In the Functional Data Explorer itself, there are two primary questions that can be answered. The first is about how to adjust process settings and product inputs to achieve a desired function or spectral shape. We call this Functional DOE analysis, or FunDoE, for short. The second question we can ask is, how can I use entire functions to predict something like yield or quality attributes? We call this Functional m achine learning or FunML, for short. Today we're going to be focusing on Functional DOE. Let's take a closer look at the Functional DOE workflow. The first thing you want to do is to set up your design using the Custom Design Platform. Then you can go out, run your DOE, collect the results, and you want to organize your data to get it ready to put into the Functional Data Explorer. The remaining steps in our Functional DOE workflow, will be done all within the Functional DOE Explorer. In the Functional Data Explorer, we can process our data, we can smooth our individual curves, we can extract our shape components, and then we'll use our shape component scores for DOE modeling, and we can use our profiler to address the goals of our DOE. All of this is done within the Functional Data Explorer. Now, there are many presentations at this Discovery Summit, at previous Discovery Summit, even in our Mastering JMP series on jmp.com, that will go over lots of details around the Functional Data Explorer. I'm not going to be talking specifically about the Functional Data Explorer today. What I want to talk about is how do you set up your design for a functional DOE and how can you organize your data. To do this, I'm going to use this Bead Mill Case Study. In this example, what we have is we're essentially milling pigment particles for LED screens. You start off with beads and pigment in this slurry, in this holding tank. It goes through, it flows through this milling chamber, and comes back to the holding tank in a continuous process. So if we were doing a DOE on this process, some factors that we could look at is the percent of beads we're starting off with here in the holding tank, the percent of pigment particles we're starting off with. We can look at the flow rate through the system and also the temperature. When we're looking at the goal of this DOE, we essentially want to achieve an optimal size over time curve. So let's take a look at that optimal curve. The optimal curve is represented by this green curve here. So essentially, we want our pigment particle sizes to decrease, so they fall within specification quickly. And our specification range is represented by this green shaded area. We want those particles to remain within specification throughout the duration of the run. That is our optimal curve. Let's go ahead and take a look at data prep. I'm going to talk about data prep first, and then we'll move backwards and talk about the DOE design. For data prep, there are three main tips and tricks I want to share with you. First of all, I want you to understand that the Functional Data Explorer accepts data in different formats. The Stacked Data Format is the default format, and it's the most versatile. But you can also use Rows as Functions. I'm going to go over some table manipulations, such as Stack, Split, Join, and Update, to show you how you can get your data ready for analysis. And then I'll also show you how you can quickly import multiple files if your curved data is stored in separate files. What data format is FDE expecting? Well, there's actually three different formats. There's Rows as Functions, Stacked Data and Columns as Functions. Let me open up a data table here and launch the FDE platform to show you that there are different tabs up here for these different formats. The Stacked Data format is the default. We have Rows as Functions and Columns as Functions. This example here happens to be Rows as Function. Each row contains a full function. Here we have the first run from our DOE, and the function is represented here in these columns. Each column represents an X variable or an input, and then the value within the cell is a Y variable. When we go to populate the Functional Data Explorer, we can come in here, go to Rows as Functions, our Y output is represented in these columns, we can put in our DOE factors, and our ID function, and then you can go ahead and analyze that data. So this is Rows as Functions. One thing to know is that Rows as Function assumes that observations are equally spaced in the input domain, unless you have an FDE X Column Property. The FDE X Column Property is something that comes into play when we design our DOE, which we're going to talk about here in a second. But I just want to show you here, next to each of these columns, I have a Column Property associated with it. And you can see here's the FDE X Column Property, and the X input value will be two here. If you want to use the FDE X Column Property , I'll show you here at the end how you can use the JMP scripting language to assign that. So that's Rows as Functions. Now let's look at Stacked Data. Here's an example of Stacked Data, where I have one run or one curve over multiple rows, and each row is an observation of the curve. So in row one here, I have a value for X and a value for Y, and that continues over multiple rows. This is the most common and the most versatile way of organizing your curve data. And when we populate the Functional Data Explorer, we're here in the Stacked Data format, we'll put in our X and our Y of our function, put in our ID of the function, and then we can put in our DOE Factors here as supplementary variables. The last type of format that the functional data can use is Columns as Functions. I've never seen data organized this way, and it's a little perplexing. It's hard for me to get my mind around why you would organize your data this way, but I'll show it to you even though it's not very common. In this example, each row is the level of your X variable of the function. So here we have a column for time, and each row represents the X measurement, and then each column represents a run. Let's go ahead and launch the Functional Data Explorer. We'll come over here to Columns as Functions. We can put in our X variable, which is time, and all of our output variables, which are each of our batches. And you'll notice in here we cannot input supplementary variables because we don't have any way of defining which factor or which treatment is associated with each of these runs. So you cannot do Functional DOE with Columns as Functions. Now let's talk about getting your data into your DOE data table. To do this, we're going to use the Tables menu. We have lots of different platforms here where we can manipulate our data. The two things that we may want to do with our data is reshape it, using Sort or Stack, or we may want to add data, by using Join, Update, or Concatenate. And especially if you're new to JMP, some of these table manipulation platforms can be a little confusing when you start using them. The little icons next to each of the platforms, can be very helpful to know which platform does what. So what I've done in my journal is I've taken each of these icons and I've blown them up here, so we can take a closer look at these icons to understand what each of these platform does. Let's first talk about reshaping with Stack and Split. Let's first talk about Stacking. Stacking is going from wide to tall data. In this example, you have data in multiple columns, and you want to combine that data into one column. Let's look at an example here. Here we have wide data, we have Rows as Functions. Let's say we want to Stack this data, so we can use it in the Stacked data format, in the Functional Data Explorer. We're going to come up here to the Tables menu, go to Stack. I'm going to pick all of those columns I want to stack. And here I have 50 measurements in each of my functions. I'm going to select all 50 of those rows and say I want to stack them. I can come down here and define what my new column names are going to be. The data that I'm stacking is actually my size data. My label column, which happens to be my column name here, this refers to my time. Now, two things when I'm doing table manipulation, I always give my output table an explicit name. Otherwise, JMP will call it Untitled, and it will iterate through untitled numbers. So I like to give them, each of my tables, an explicit name, and then keeping dialog open. You can check this box to keep this dialog open, so when you hit Apply and see your results, if you didn't get the results you're expecting, you have your dialog here to review what you did and maybe fix what you need to fix to get the desired output. Now, this data is stacked and ready to go. Let's go through an example of Split. Split is when you're starting with tall data or stacked data, and you want t o split it out into different columns. In this example here, I have stacked data, and let's say I want to split it out so I can use Rows as Functions in FDE. I'm going to come up here to the Tables menu. I'm going to use Split. And this Split dialog is probably the most confusing of all of them. Even after using JMP for many, many years, I always have to step back and think about how to populate this. But the Split by Columns is essentially what's going to be your new column headers. So I want Time as my new column headers, and I want to Split out my size data, and I want to be sure to group this by Run O rder. A gain, give this an explicit name, and I can keep the dialog open to see how I split this data. Now here I have my data is wide, I can use Rows as Functions in the FDE. That is reshaping your data. Going from wide to tall or from tall to wide. Now let's talk about adding data. A lot of times you're starting out with a DOE data table that you created, such as this. Let me just delete this column out. I want to add, I want to be able to join my curve data to this table. Here's my curve data in a separate table. Essentially, what I want to do is, I want to add the columns in the second table to my first table. I'm going to use Join. Join adds columns. I'm going to start here with my DOE table. I'm always going to start with my DOE table because my DOE table has all of these scripts in here that I can use to analyze my data. These are very important. So you always want to start with your DOE table. And we're going to use Join, going to join it with our curve data. You always want to make sure that you're matching up based on your row numbers, so the right curve for the right run goes with the correct factors for that run. And I'm going to select all the columns in my DOE table and my Functions from my Curve Table. A gain, I can use an explicit name here when I create this table. Now, I have my table ready for analysis. That's an example of Join. Let's talk about Concatenate. I don't use concatenate too much for my DOE data prep. Concatenate… You use that when you want to add rows to a data table. Then DOE, we typically have all of our rows, all of our runs in our data tables. We don't need to concatenate, but I just want to run through this example real quick. Let's say I have my data for my first 16 runs. I have 10 observations per run, so I have 160 rows. Then I run my 17th run. It looks like this. That's 160. Here's my 17th run with the 10 observations from that run. Essentially, what I want to do is join this data table, or sorry, concatenate. I want to add these rows at the bottom of this data table. I can start here, come to Concatenate. We're going to add this. With concatenate, you have this option to append to first table. I'm just going to add these rows, append this data table. Now, we have 170 rows of data. That's Concatenate. I want to end up here with Update. Update can be a very handy tool when you're populating your DOE table with curve data. Here's an example of the DOE data table I created. I have columns here to populate my curve data. Here's my curve data. Here's my DOE data table. Essentially, what I want to do in Update is I want to be able to populate my blank cell with the information I have in this data table. I can do that by matching run order, and then JMP will automatically match up the columns with the same names and update this data table. Let's come here in T ables, Update. Select my table that has my curve data. Match on Run Order, say OK. And now, this data table is updated. Those are some table manipulations you can do to get your data ready for analysis. The last thing I want to talk about is importing multiple files. Let's say that your curve data gets stored as separate files for each batch. I have this example here of… I have my curve data in 17 different files, and they happen to be CSV files. I want to be able to import each of these CSV files and concatenate them together so I have one data table. You can easily do this by using the Import Multiple Files function under the File menu. When you use Import Multiple File, you can click on this button here to select the folder that contains all of those files. Here's a list of all those files. Now, the file name itself actually contains my batch number, and this is data that I actually want to pull out of the file name. I'm going to add the file name as a column. We'll import. Here's my curve data with time and size, and here's my file name. Now I can come up to the Columns menu and use this column utility to convert my text to multiple columns. I just have to put in the delimiter I want. I'm going to use the underscore that's before the batch number and the dot that's after the batch number, and I can say OK. That gave me three columns: the curve, the batch number, and the file extension. This is the data that I want. I'm just going to delete these other columns here. And now I have all of my curve data for all my 17 runs in one file with the batch number. That is what I wanted to show you for data prep. Now, let's talk about setting up your DOE design. There's a couple of tricks that I want to show you. In your DOE Dialog, there's two things to think about. First of all, we want to make sure we're removing this default response, and then we're going to talk about how to define the functional response based on the format of your curve data. Let's go ahead and launch our DOE Dialog. We're going to come up here to DOE, go to Custom Design. Here's our DOE dialog. Now, the DOE Dialog, like I said, will have this default Y response. If we just have a functional response in our DOE, we don't need this default response, so we need to get rid of this. What we don't want to do is just delete the name because that response is still there. What we want to do is select that default response and actually use Remove to get rid of it. Then we want to add a functional response. I'm going to come here and add a functional response. When we're defining our functional response, we can give it a name. We can say the number of measurements per run and the values. Let's go ahead and do this for our DOE. Our responses size… This is what's on the y- axis of our function. Then we can tell the DOE platform what our X values look like. We can define the number of measurements with the number of X values and what those X values are. Let's say I'm going to measure the size every 2 hours. I'm just going to type in here every 2 hours up to 20 hours. That looks good. The next thing I need to do is add my factors. I have saved my factors and my factor ranges to this factor table. I'm just going to load in these factors. I have my factors up here. Next thing I want to do is specify my model. I'm going to choose a response surface model which will add all my two- way interactions and all my quadratics. Finally, I can enter in the number of runs. JMP is recommending a default number of 21, but let's say I only have enough time and resources to do 17. I'll say 17. I will ask them to make the design using all of those inputs that I entered. It just takes a couple of seconds for JMP to create this design for me. Here's my design. Here are my 17 runs with the treatment I want to apply for each of those runs. When I'm creating my DOE table, I always want to use this Make Table button. I always like to include the run order column because the order that these runs are executed is very important. I'm going to include that run order column and make our DOE data table. Here's our DOE data table. We have our treatment. We have a place for R to enter in results for our function, and I have my run order column here at the end. I also have my scripts that reflect the functional DOE and also the model I specified. Whenever I'm adding data, my curve data, like I said before, I always want to add it here to my DOE data table because it contains information about the functional data analysis and the DOE model that was specified. That's a quick overview, but I want to give you some tips about defining the functional response based on the format of your curve data. Let's come back here. We'll go back. Let's come back up here to Responses. The way that you populate this information here will define how your DOE data table looks. I want to give you a couple tips for what to enter here, depending on what your curve data looks like because we want the data prep part to be as easy as possible. There's a couple of things to consider. First of all, is your curve data wide or tall? We talked a lot about this, right? Do you have wide data? Or do you have tall data? Is it stacked or are you going to have rows as functions? The other thing to consider is whether your data is equally spaced, if you have the same X measurements, or whether your measurements are asynchronous. What do I mean here? Well, let me pull up a couple examples. In this example here, I just have a few measurements per run. I have 10 measurements per run, and they are all equally spaced, and I have the same measurements for each run. When I go to enter in this information, there's just a few to populate here. It's not that difficult. But you might have a scenario where you have asynchronous data. In other words, you have different measurements for each of your runs, and you might have a situation where you're collecting a lot of data points, maybe… My rule of thumb is if you're around 10, less than 20, yeah, go ahead and populate your values here. But once you start getting up above 20, certainly hundreds, that's a lot of information to add here to your response here. The other thing to consider is if your data will be manually entered, are you going to manually enter the responses or are you going to use Join or Update? Let's run through some scenarios. Let's say you have rows as functions. You have a few measurements, and you're going to manually enter your data. Well, if you're going to do that, then set up your DOE data table like this. Or set up your… sorry, your functional response like this. This is what your DOE data table is going to look like. You can manually enter your results in here, and then you can use this script here to run your functional data analysis. That's the first scenario. Let's say you have rows as functions, you have a few measurements, but you want to use Update to update your datas. Again, l et's come back and take a look at this example here. In this example, since I defined the name of my response, I get time in here with my column header. Let's say when I bring in my data, I just have the number here in my column header. So if I was going use Update, these column names do not match. To make these column names match, what I want to do is come back here, remove the name, and then when I create my DOE data table, I just have the number. And then I can use Update to update this data table. Let's go ahead and do that. Here's my data. Let's go ahead and update with my current data. I'm going to update based on matching the row numbers, and now my data is in here. I can use this script here to go ahead and go into the Functional Data Explorer to start analyzing this data. So that scenario, let's say I have rows as function. Again, I have wide data, but I have many measurements. I have many more than 10. Let's say I have 50. Entering the 50 values in here doesn't make a whole lot of sense. What I'm going to do is I'm going to set the number of measurements to one, and I can just set the values to one as well. When I go to make my DOE data table, it will look like this. I will get my run, order. I will get my factors, and I'll get this blank column. All I need to do is delete that column. Here are my 50 measurements for each of my curves. Again, I'm going to use Join, like I showed you up above. I'm going to match based on run order, bring in everything from my DOE table, my functions from my curve table. I can give it an explicit name and say OK. In this example here, since I've used Join, I'm essentially ignoring what I set up as the details around my functional response. This script here is not going to work. In this scenario, I will have to come back here, go open up the Functional Data Explorer, enter my supplementary variables, my… This is rows as function. Enter in my supplementary variables, my run ID, and my curves. When I go to do the functional DOE analysis, it will come back and look at the model here that's specified here. It's generalized regression script, so it will remember the model that I specified when I set up my DOE. That is that example. Let's talk about stacked format. With stacked format, we typically are going to be adding the data using Join. Again, what I populate here doesn't really matter, just as long as I have a functional response entered in here. Again, I get this same data table. I can delete out the response. Oops, I added to it. Delete column. I can remove that response. Now, I can use Join to bring in my stacked curve data by matching on run order, bring in everything in from my DOE table and my function data from my curve table. Again, for this example here, running this script is not going to work, so I'm going to have to manually launch the Functional Data Explorer, bring in my X, my Y, my supplementary variable, and my run order, and then I can go ahead and execute the Functional Data Explorer when I go to do functional DOE. Again, as before, it will look at the model that's included in generalized regression script that is based on the model that you specified when you designed your DOE. The last thing I want to mention real quick is this FDE X Column Property that I talked about before. Let's say that… this is a scenario where I want to bring in data, my curve data, where I have rows as functions. So I have rows as functions, I have many measurements. I want to add the curve data using Join, but my column headings contain text. In this case, I have the units of measurement for each of my X values here in my column headers. I can join this data together, bring in my curve data by matching on run order. We then bring in all my data from my DOE table, bring in all of my curve data. Let's say I want the Functional Data Explorer to recognize the number in my column header. Well, to do that, I need that FDE X Column Property. But when I go in here to Column Properties, you're not going to find FDE X Column Property here. What I can do is I can use a script to define my FDE X Column Property. Actually, it's going to be based on this. What I can do is run this script, and now I have a column property assigned where I have the number that the FDE will recognize as your X value. That was my last tip or trick. Let's just do a quick wrap- up, a review of the tips and tricks starting with your DOE design. You always want to remove that default Y response before you add your functional response. You're going to define your functional response based on the format of your curve data because you want to make your data prep as easy as possible. You always want to add your curve data to the DOE data table to take advantage of all, not only the FDE script, but the model script that is created for you by the Custom Design Platform. When you're preparing your data for analysis, when you're bringing in your curve data, just know that the Functional Data Explorer accepts different formats, stacked data and rows as functions. You can use Stack, Split, Join, and Update to get your data ready for analysis. And if your curve data is stored in separate files, use import multiple files. I just want to acknowledge a couple of people. Ryan Parker, he is the developer of the Functional Data Explorer. I want to acknowledge him for all of his help with understanding all of the wonderful things that FDE can do. I also want to thank Chris Gotwalt for his leadership, also for some of the slides that I used at the beginning of my presentation. With that, I thank you very much. If you have any questions, I'd love to hear about them in the chat. Thank you.
Labels
(13)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Cluster Analysis of Carbon/Carbon-Free Mixed Oxide Nanocomposites by Textural Properties (2022-US-30MP-1150)
Monday, September 12, 2022
A series of С/М x O y /SiO 2 nanocomposites has been synthesized through pyrolysis of a resorcinol-formaldehyde polymers pre-modified with metal oxide/silica nanocomposites. In turn, the binary nanoxides (M x O y /SiO 2 , where M = Mg, Mn, Ni, Cu and Zn) were synthesized via thermal oxidative decomposition of metal acetates deposited onto fumed silica. These materials are promising for adsorption and concentration of trace amounts of organic substances or heavy metals. The nanocomposites exhibit mesoporosity and narrow size of pores, as seen from their distribution profiles. The porosity depends on composition of the materials. Hence, the textural parameters of carbon-containing and carbon-free types (or classes) served as input data to develop classification models of both materials classes using unsupervised method: hierarchical clustering. Once the cluster formula was derived, it was established that surface and the volume of micropores (Smicro, Vmicro) together with the volume of mesopores (Vmeso) have the the highest R2 (0.83 - 90) to enable successful clustering. Macroposity demonstrates the lowest fit (R2 < 0.1), and its two respective parameters (Smacro, Vmacro) are considered as the weakest contributors to the two-cluster model. In parallel, principal components analysis was helpful to distinguish the subject classes of the nanomaterials, at reduced number of the variables (three components at eigenvalue > 1). Three- and one-component 2-Means clustering was conducted to assign each composite to its proper class. Thus, the case for two multivariate classes of nanomaterials can be described by various independent methods of the data science. Hello. I'm Dr Michael Nazarkovasky, Ukrainian researcher in chemistry and materials science from Brazil. Evolved data driven solutions in my area of knowledge and expertise. The presentation is made in collaboration with Ukrainian colleagues from National Academy of Sciences of Ukraine, supporting and promoting their scientific research programs during such a difficult period. The subject says on statistical and data analysis approaches, deepen conception behind the experimental results and phenomena. In particular, unsupervised methods are helpful for multivariate cases like this when two or more classes of materials are characterized by a large body of the parameters measured or calculated in the course of the lab analysis. This case is about hybrid materials which contain mixed nanoxides and carbon phases. The nano hybrids combine properties of both components; well ordered micro and meso porosity, a large surface area and high porosity in general, high corrosion resistance, thermal and mechanical stability, hydrophobicity and high conductivity due to the presence of carbon, developed for active sites attributed to the metal oxide nanoparticles. Hence reasons of the subject Nanomaterials' I mportance exists in the variety of their applications; Catalysis, adsorption, sensors, energy field, and hydrogen adsorption. Typical preparation of binary oxides, non carbon oxide nanocomposites consists in three stages. On the first stage, preparation of the homogeneous dispersion of silica in the aqueous solution of the corresponding metal acetate under stirring at room temperature was conducted. On the second stage, dispersions were dried at 130 degree C during five hours and sift through the 0.5 millimeter mesh. At the last stage all of the 10 powders were [inaudible 00:02:26] for 2 hours at 600 degree C in air. Also the reference sample of fumed silica without metals was treated under the same conditions, by bringing all these three stages; homogenization of the aqueous dispersion, drying, grinding, sieving, and carbonization at the same temperature. The process of impregnation of fumed silica with an aqueous solution of metal acetate and subsequent removal of the solvent, the adsorbed acetates, are distributed uniformly over the matrix surface. Modification of reserves of resorcinal formaldehyde polymer by oxide nano composite was carried out by easing the process mixing resorcinal formaldehyde, and this binary pre-synthesized nanocomposites reference silica the weight ratio of an aqueous solution under stirring at room temperature. All these samples were sealed and placed in a thermostatic oven for polymerization, and all synthesized composites were seized with further drying at 90 degrees C for 10 hours. Carbonization of the samples was carried out in a tubular furnace under nitrogen atmosphere upon heating from room temperature up to 800 degrees C at a heating rate of 5 degrees C per minute, and kept at a maximum temperature for two hours. Hence the carbonized samples are labeled as C and the initial materials which do not contain carbon are [inaudible 00:04:29]. Actual properties of the… Or in other word, parameters of porosity were calculated using modified Nguyen-Do method from the low temperature nitrogen adsorption- desorption. This is a standard [inaudible 00:04:49] method for porosity and it's called the [inaudible 00:04:51] . The calculated parameters are assigned as variables for further data processing using JMP. To be more exact, the specific surface area and total pore volume were derived from the BET measurements and then served to calculate respective portions of micro, meso, and macro porosity. In this case we have a set of inward variables. For example, Nguyen-Do method was developed initially for calculation of carbon materials with a sleeve-like porosity, afterward by one of the co authors of the presentation, Professor Vladimir Gun'ko . The method was modified and extended through a large variety of other materials which may contain cylinders, also slits, and voids among the particles within the aggregates and agglomerates of aggregates. Not only for carbon materials and the method serves also for hybrids, as for individual materials as for hybrids also. So let's start from the basics. In multivariate analysis indicates that not all the parameters around the health are well related with each other in case of non carbon materials. Specifically there a meso porosity dominates overly serious as shown on the heat map and from the table. The parameters corresponding to microporosity are demonstrating correlation only within their group and group and with macroporosity. Yes, surprisingly. Contrastingly, the heat map of the carbonized nano hybrid speaks for more consistent and more ordered structure with almost complete correlation between all the types of the porosity, whereas the role of microporosity is significantly increased. Comparison between parameters or variances reveals the differences especially for micro porosity. For the specific surface area, as for a total pore volume, for example, total pore volume and volume of the mesop ores can be also considered as factors to claim the difference between both types of the nanomaterials. All eight parameters were taken to perform hierarchical clustering and it is easy to see that the minimal optimal in the same time model can be offered for three clusters on the cluster path and on the constellation plant. Think oxide sample, it cannot separate within the non carbon group but can be attributed in other carbonized cluster. Well, main parameters as seen from the summary are the volume and surface of the micro pores. In other words, micro porosity. Indeed, some parameters; surface and volume of macro and mesopores are out of the group samples of the non carbon nanomaterials. I'm talking about, namely for a sync oxide. Anyway, the mean values for both parameters do not match over the whole parallel plot of their clusters. Principal components analysis help to represent all variables in three linear combinations. According to the [inaudible 00:09:02] , the eigen values less than 1 are not taken into consideration. This is why we have only three components whose values are higher than 1. As the two first describe almost 80% of the samples or nine samples from 12, with the help of the third component, the least important, almost all the samples fit such a model. The first component comprises both micropores parameters [inaudible 00:09:39] and volume of the meso pores. The other variables take a secondary role in the second and third components, describing together another half of the temples. Mapping the points over the score plots in 3D, it is easy to conclude that both groups, carbonized and non carbonized, can be separated into two clusters defined by three principal components and two almost flat clusters comprising the points situated on the plane are set by two main clustering algorithms and described, yes, by these three components. Taking a closer look at the results of predictor screening by boosting, again microporosity is referred to be the main property. The simplified analysis, two variables can be extracted and plotted with the PCA cluster, I mean using the same PCA, however, with completely different amounts, one instead of three components, and yes, whereas a single principal component serves to describe the cluster model. The cluster formula and equation for the principal component are provided. It can be recommended for future classification or for a qualitative analysis of synthesized samples. As conclusions, I can say that the presented synthesis method makes it possible to obtain mesoporous carbon nanocomposites uniformly filled with metal and metal oxide phases which were pre-synthesized in silicon matrix. With the carbonization, the portion of micro pores is growing, yes, the specific surface area increased with decreased total porosity, total pore volume. High order hybrid carbon oxide nanocomposites with large specific surface area. The controlled size distribution of the modifier, which is important from the clinical point of view, and significantly expands the scope of application of the synthesized materials. Parameters of textural properties are effective variable helpful to identify classified nanooxide materials. Data visualization has given insights to select adequate approaches to analyze the experimental data. K-Means Clustering, Self Organized Map, and Hierarchical C lustering have proven to be powerful tools for classification of the subject materials by actual properties. Principal Component Analysis in turn had reduced the set of variables for a definite and simple classification. The study claims two cluster models described by three or even one principal component to classify carbonized and carbon- free hybrid nano composites. I'm thankful also for the financial support for Brazilian agency. I'm very thankful to my colleague David Kirmayer from the Hebrew University of Jerusalem and one of the co authors Maria Galaburda, who actually synthesized these samples. Thanks to P olish Foundations and International Visegrad Fund. Thank you very much for your attention. It was a pleasure for me to make such a presentation.
Labels
(12)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Estimating the Parameters of the Poisson Regression Model Under the Multicollinearity Problem (2022-US-30MP-1148)
Monday, September 12, 2022
In this paper, we will review the most prominent methods for estimating parameters of the Poisson regression model when data suffers from a semi-multicollinearity problem, such as Ridge regression and Liu estimator's method. Estimation methods were applied to real data obtained from Central Child Hospital in Baghdad, representing the number of cases of congenital defects of children in the heart and circulatory system for the period from 2012-2019; The results showed the superiority of the Liu estimators' method over the ridge regression method based on (AIC) as a criterion for comparison. Keywords: Poisson regression, Liu estimators, Multicollinearity problem. Hello everyone, my name is Raaed Fadhil Mohammed. I am a statistician. I lecturer in University of Mustansiriyah. My paper title is Estimating the Parameter of Poisson Regression Model Under the Multicollinearity Problem . Outline: Poisson Regression Model, Multicollinearity problem, Ridge Regression Estimator Method, Liu Estimators Method, and Real Data Example. Conclusion and References. Poisson Regression Model. One of the types of regression model that fall under linear-logarithmic regression model as by taking the natural logarithm of the distribution formula, it turns into a linear procedure. Random errors in the model follow a Poisson distribution with a parameter mu. The model is based on two essential assumptions about the distribution as it differs from the distribution of random errors in the linear regression model and the properties of the Poisson distribution parameter mu as a function of predictor variables. M ulticollinearity Problem. Multicollinearity Problems occur when two or more predictor variables are correlated to a solid linear relation, so it's difficult to separate the effect of each predictor variable from the dependent variable in practice. Or when the value of one of the predictor variables depends on one or more of the other predictor variables in the model under study, as well as if the data takes the form of a time series or across- section data. The multicollinearity problem can be classified into two types: Number 1, Perfect multicollinearity. The determinant of the information matrix is zero, x transpose x determined equal zero. It follows from this is impossible to estimate, the parameters of the regression model due to the inability to calculate the inverse of matrix x transpose x. The best method in this case to calculate x transpose x. We can make use principal component analysis. Number 2, Semi-perfect multicollinearity. In this case, if the value of the determinant information matrix is minimal, close to zero, then the parameters estimated considerable variance. The best method in this case we can use regression method or Leo estimator method. The following formula here can express the variance- covariance matrix of the parameters estimated. Perhaps the best statistical method for measuring the multicollinearity intensity is the variance in flation factor VIF, whose formula is as follows. VIF equal one over one minus R square. R square here determined coefficient. Ridge Regression Estimator Method. One of the important alternatives for estimating the parameters of regression module when there is m ulticollinearity between predictor variables. This method established according to the principle of the researchers Hoerl and Kennard, which is by adding a small positive quantity to the mine diameter elements of the information matrix. The regression estimators are based when k greater than zero so that the base amount can be expressed by the formula, Z minus identity by beta. Liu Estimators method. The researcher Liu 1993 laid the foundations of this method to address the issue of the variance inflation of the estimated parameters in the presence of multicollinearity a problem. The Liu estimator for the parameter Poisson regression can be expressed in the following formula. Also Liu estimators are biased when d greater than zero and the magnitude of the bias is z minus identity by beta. Liu estimators are biased, the reason of the bias is the added value d, which ranges between zero and one. Also, the calculated mean squared error according to Liu estimators' method is less than the mean squared error for the same parameters if estimated according to the maximum likelihood method. Real Data example. We will obtain real data concerning congenital defects of the heart and circulatory system in a new borns from the Central Child T eaching Hospital in Baghdad, Iraq, where the distribution of a dependent variable y represents abnormalities of the heart and circulatory system in children was studied. Also the revealing existence of a multicollinearity problem among the predictor variables under study. The case of congenital disabilities arriving at the Central Child Teaching Hospital are recorded in a form prepared by the Statistics Division in the hospital in the form of count data and totals within semi monthly periods, the sample was taken for the period from 2012 to end 2019, and a Poisson regression model was built as one of the appropriate models to describe this phenomenon as the following formula: yi equ al exponential beta one xi 1 plus beta 2 xi 2 plus beta 3 xi 3 plus beta 4 xi 4 plus beta 5 xi 5 plus beta 6 xi 6 plus beta 7 xi plus ui. That y represent the total number of children with congenital heart and circulatory defects in each period. Xi1, the total weighted of infected children within each period. Xi2, the total ages fathers of inflected children within each period. Xi3, the total ages mothers of inflected children within each period. Xi 4, represents the number of infected male children within each period. Xi5 represents the number of inected female children within each. Xi 6, the number of infected children born from consanguineous marriages within each period. Xi7, the number of infected children whose mothers were exposed to radiation or life influence such as taking certain medications and drugs during pregnancy. Beta one, beta two, beta three, beta four, beta five, beta six and beta seven beta. The slope parameters in the model and beta note represents the constant term. ui represent the random error in the model. This table, Testing Data and Diagnoising Multicollinearity to find out probability distribution according to which response variable can be distributed. We use jump pro 16.2 and it was found y, dependent variable follows the Poisson distribution with distributed parameter, mu equals 6.5. To verify, the suitability of the Poisson distribution to the response variable y. The testing goodness of it was conducted for the variable of the total number of children with congenital heart and circulatory defense in each period. The throw which we make show that the poison distribution is the most promoted distribution that dependant variable can follow. Where we not have the goodness of test value is 4958.4579 with significance level close to zero. As on the table in the front of you. To detect whether there is multicolliniarity among the product variable under study, we can calculate the correlation matrix between the predictor variables. From the figure we observe that value of correlation coefficient are significant and large for all predictor variables, as each variable is associated with all predictor variables, with the strong direct linear correlation. This table shown the value of various function factors. As the largest of them, well those of the projector variables, X2, X3 and X4 The variance inflation factor for undermining projector variables exists the number 18. From this we conclude that there is a linear multiplicate between the predictor variables and the [inaudible 00:16:24]. Application of Poisson regression method. Parameter estimator of Poisson regression model using method regression, we observe that the total number of children with congenital health, and circulatory defects in each period depends on the increase and all parameters of [inaudible 00:17:11]. However, most variables are insignificant. X1, X4, X5 and X6 because of the effects of semi- perfect multi collinearity. Also, the result indicates that the base parameter is k equal zero point one, two. [foreign language 00:17:53] 86.4959. This table we can obtain by using JMP. When we are applying the Liu estimator method to estimate the coefficient of the Poisson regression module in the presence of the multicollinearity problem. We use the JMP secret to connect between our language and JMP. This secret to connect and run from JMP, the package of Liu regression in our language. While applying to [inaudible 00:19:12], the coefficient, we obtained this result. We observed that the total number of children with the congenital hearts and circulatory in each period depends on the extent of increase in all parameters of the model, despite the insignificance of the variable xi 1. Because all variables under study are increasing the number of children with congenital disabilities, but it is very good properties. Also the result indicates that Hiaki formation criteria 35. 29 add the base parameter d equal 4.1. When comparing the two methods regression and Liu estimator, we know that the estimator approach given a low value of information. Has more significant proficient when they compare to the regrasion method. Conclusions. In this paper, we review the most prominent method of parameter estimating of the Poisson regression model when the data suffer from the problem of semi- perfect multicollinearity, where took the ridge regression method and Liu estimators' method and compared the two methods based on Account Information Criteria as a criterion for comparison. By applying regression analysis method in the presence of a semi-perfect multicollnearity problem to real data regarding congenital heart and circulatory defects in newborns, obtained from the Central Child Teaching Hospital for the period from 2012 to 2019, we find the Liu estimator's method is more efficient than the regression method because it has a lower Akaike's information criterion. It also gives more reliable results and more accurate p-values. Number 3: Through Liu estimators' method, it is clear that all predictor variables under study are influential in the regression model, even if they are not significant, as all parameters are influential in increasing the number of children with congenital disabilities but in varying proportions. Thank you.
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Explainable AI: Unboxing the Blackbox (2022-US-45MP-1147)
Monday, September 12, 2022
Artificial intelligence algorithms are useful for gaining insight into complex problems. One of the drawbacks of these algorithms is they are often difficult to interpret. The lack of interpretability can make models generated using these AI algorithms less trustworthy and less useful. This talk will show how utilizing a few features in JMP can make AI more understandable. The presentation will feature performing “what if” hypothesis testing using the prediction profiler, testing for model robustness utilizing Monte Carlo Simulations, and analyzing Shapley values, a new feature in JMP Pro 17, to explore contrastive explanations. Welcome to the talk E xplainable AI: Unboxing the Black box. Let's introduce ourselves and let's start with Laura. Hello, I'm Laura Lancaster, and I'm a S tatistical Developer in JMP and I'm located in the Cary office. Thanks. What about you, Russ? Hey, everyone. Russ Wolfinger. I'm the Director of Life Sciences R&D in the JMP Group and a Research fellow as well. Looking forward to the talk today. And Pete? My name's Peter Hersch. I'm part of the Global Technical Enablement T eam and I'm located in Denver, Colorado. Great and my name is Florian Vogt. I'm Systems Engineer for the Chemical Team in Europe and I'm located in beautiful Heidelberg, Germany. Welcome to the talk. AI is a hot topic at the moment and a lot of people want to do it. But what does that mean for the industries? Does it mean that scientists and engineers need to become coders, for processes in the future run by data scientists? A recent publication called Industrial Data Science - a review of Machine Learning Applications for Chemical and Process Industries explains industrial data science fundamentals, reviews industrial applications using state of the art machine learning techniques and it points out some important aspects of industrial AI. These are the accessibility of AI, the understandability of AI and the consumability of AI, and in particular the output. We'll show you some of the features that we think are contributing to this topic very well in JMP. Before we start into the program of today, let's briefly review what AI encompasses and what our focus today is located on. I've picked a source that actually separates it into four groups. Those groups are: first, supporting AI also called Reactive Machines and this aims at decision support. The second group is called Augmenting AI or Limited Theory and that focuses on process optimization and the third group is Automating AI, or Theory of Mind, which, as the name suggests, aims on automation, and the fourth is called autonomous AI or self aware AI, which encompasses autonomous optimizations. Today's focus is really on the first and the second topic. We had a brief discussion before. Russ, what are your thoughts on these also with respect to what JMP can cover? Well, certainly the term AI gets thrown around a lot. It's used in many kind of different nuanced meetings. I tend to prefer meetings that are definitely more tangible and usable and more focused. Like we're going to zoom in on today with some specific examples. The terminology can get a little confusing though. I guess I just tend to kind of keep it fairly broad, open mind whenever anyone uses the term AI and try to infer its meaning from the context. Right. That's in terms of introduction, now we'll get a little bit more into the details and specifically into why it is important to actually understand your AI models. Over to you, Pete. Perfect. Thanks, Florian. I think what Russ was hitting on there and Florian's introduction is, we oftentimes don't know what an AI model is telling us and what's under the hood. When we're thinking about how well a model performs, we think about how well that fits the data. If we look here, we're looking at a neural network diagram and as you can see, these can get pretty complex. These AI models are becoming more and more prevalent and relied upon for decision making. Really, understanding why an AI model is making a certain decision, what criteria it's basing that decision on, is imperative to taking full advantage of these models. When a model changes or updates, especially with that autonomous AI or the automating AI, we need to understand why. We need to confirm that this model is maybe not extrapolating or basing it on a few points outside of our normal operating range. Hold on. Let me steal the screen here from Florian, and I'm going to go ahead and walk through a case study here. All right, so this case study is based on directional drilling from wells near Fort Worth, Texas. The idea with this type of drilling is unlike conventional wells, where you would just go vertically, you go down a certain depth, and then you start going horizontal. The idea is these are much more efficient than the traditional wells. You have these areas of trapped oil and gas that you can get at with some special completion parameters. We're looking at the data here from these wells, and we're trying to figure out what are the most important factors, including the geology, the location, and the completion factors, and can we optimize these factors to increase or optimize our well production? To give you an idea, here's a map of that basin, so like I mentioned, this is Fort Worth, Texas. You can see we have wells all around this. We have certain areas where our yearly production is higher, others where it's lower. We wanted to ask a few questions looking at this data. What factors have the biggest influence on production? If we know certain levels for a new well, can we predict what our production will be? Is there a way to alter our factors, maybe some of the completion parameters and impact our production? We're going to go ahead and answer some questions with a model. But before we get into that, I wanted to ask Russ, since he's got lots of experience with this. When you're starting to dig into data, Russ, what's the best place to start and why? Well, I guess maybe I'm biased, but I love JMP for this type of application, Pete, just because it's so well suited for quick exploratory data analysis. You want to get a feel for what target you're trying to predict and the predictors you're about to use, looking at their distributions, checking for outliers or any unusual patterns in the data. You may even want to do some quick pattern discovery clustering or PCA type analysis just to get a feel for any structure that's in the data. Then also be thinking carefully about what performance metric would make the most sense for the application at hand. Typically the common one, obviously for continuous predictors would be like root means square error, but there could be cases where may be that's not quite appropriate, especially if there's direct cost involved, sometimes absolute error is more relevant for a true profit- loss type decision. These are all things that you want to start thinking about as well as how you're going to validate your model. I'm a big fan of k-fold cross validation. Where you split your data into distinct subsets and hold one out and being very careful about not allowing leakage or also careful about overfitting. These are all concerns that tend to come top of mind for me when I start out with a new problem. Perfect. Thanks, Russ . I'm going to walk through this in JMP some of the tools we can use to start looking at our problem. Then we're going to cover some of the things that help us determine which of these factors are having the biggest impact on well production. I'm going to show Variable Importance and then Shapley Values and we'll have Laura talk to that and how we do that. But first, let's go ahead and look at this data inside a JMP. Like we mentioned here, I have my production from these wells. I have some location parameters so where it is latitude and longitude, I have some geologic parameters. This is about the rock formation we're drilling through. Then I have some completion parameters and this is factors that we can change while we're drilling as well, some of these we can have influence on. If we wanted to go through and this dataset only has 5,000 rows. When talking to Russ in starting to prep this talk, he said just go ahead and run some model screening and see what type of model fits this data best. To do that, we're going to go ahead and go under the Analyze menu, go to Predictive Model and hit Model Screening. I'm going to put my response, which is that production, take all of our factors, location, geology and completion parameters, put those into the X and grab my Validation and put it into Validation. Down here we have all sorts of different options on types of models we can run. We can pick and choose which ones maybe make sense or don't make sense for this type of data. We can pick out some different modeling options for our linear models. Even, like Russ mentioned here, if we don't have enough data to hold back and do our validation. That way we can utilize a k-fold cross validation in here. Now to save some time, I've gone ahead and run this already so you don't have to watch JMP create all these models. Here's the results. For this data, you can see that these tree based methods: Boosted Tree, Bootstrap f orest and XGB oost all did very well at fitting the data, compared to some of the other techniques. We could go through and run several of these, but for this one, I'm going to just pick the Boosted Tree since it had the best RS quare and root average square error for this dataset. We'll go ahead and run that. After we've run the screening, we're going to go ahead and pick a model or a couple of models that fit well and just run them. Al right, so here's the overall fit in this case. Depending on what type of data you're looking at, maybe an RS quare of .5 is great, maybe an RSquare of .5 is not so great. Just depending on what type of data you have, you can judge if this is a good enough model or not. Now that I have this, I want to answer that first question. Knowing a few parameters going in, what can I expect my production level to be? An easy way to do that inside a JMP with any type model is with this profiler. Okay, so we have the profiler here, we have all of the factors that were included in the model, and we have what our expected 12 month production to be. Here I can adjust if I know my certain location. I know the latitude and longitude going in maybe I know some of these geologic parameters. I can maybe adjust several of these factors and figure out of the completion parameters and figure out a way to optimize this. But I think here with a lot of factors, this can be complex. Let's talk about the second question, where we were wondering which one of these factors was having the biggest influence. You can see based on which of these lines are flatter or have more shape to them, what is the biggest influence. But let's JMP do that for us. Under Assess Variable Importance, I'm going to just let JMP go through and pick the factors that are most important. Here you can see it's outlined the most important down to ones that are less important. I like this feature colorize the profiler. Now it's highlighted the most important factors and gone down to the least important factors. Again, I can adjust these and see, oh, it looks like maybe adjusting the depth of the well, adding some more [inaudible 00:15:51], it might improve my production. That is the way we could do this but we have a new way of looking at the impact of each one of these factors on a certain well. We can launch that under the red triangle in JMP 17 and Shapley values. I can set my options or save out the Shapley values. Once I do that, it will create new columns in my data table that save out the contributions from each one of those factors. This is where I'm going to let Laura talk to some Shapley values. I just wanted going to talk briefly about what Shapley values are and how we use them. Shapley values are a model agnostic method for explaining model predictions and they are really helpful for Black box models that are really hard to interpret or explain. The method comes from cooperative game theory and I don't have time to talk about the background or the math behind the computations, but we have a reference at the bottom of the slide and if you Google it, you should be able to find a lot of references to learn more if you're interested. What these Shapley values do, they tell you how much each input variable is contributing to an individual prediction for a model. That's away from the average predicted value across the input dataset and your input dataset is going to come from your training data. Shapley values are additive, which makes them really nice and easy to interpret and understand. Every prediction can be written as a sum of your Shapley values plus that average predicted value, which we refer to as the Shapley intercept in JMP. They can be computationally intensive to compute if you have a lot of input values, input variables in your model or if you're trying to create Shapley values for a lot of predictions. We try to give some options for helping to reduce time in JMP . These Shapley values, as Peter mentioned, were added to the prediction profiler for quite a few of the models in JMP Pro17 and they're also available in the graph profiler. They're available for Fit Least S quares Nominal Logistic, O rdinal Logistics, Neural, Gen Reg, P artition, Bootstrap Forest and Boosted Tree. They're also available if you have the XB oost Add -In. Except in that Add-I n, they're available from the model menu and not from the prediction profiler. Okay, next slide. In this slide I want to just look at some of the predictions from Peter's model. This is from a model using five input variables. These are stack bar charts of the first three predictions coming from the first three rows of his data. On the left you see a stack bar chart of the Shapley values for the first row. That first prediction is 11.184 production barrels in hundreds of thousands. Each color in that bar graph is divided out by the different input variables. Inside the bars are the Shapley values. If you add up all of those values, plus the Shapley intercept that I have in the middle of the graph, you get that prediction value. This is showing you that first of all, all of these are making positive contributions to the production and they show you how much, so the size, I can see that longitude and proppant are contributing the most for this particular prediction. Then if I look to the right side to the third prediction, which is 2.916 production barrels and hundreds of thousands, I can see that two of my input variables are contributing positively to my production and three of them are having negative contributions, the bottom three here. You can use graphs like this to help visualize your Shapley Values . That helps you really understand these individual predictions. Next slide. This is just one of many types of graphs you can create. The Shapley values get saved into your data table. You can manipulate them and create all kinds of graphs in Graph Builder and JMP. This graph is just a graph of all the Shapley values from over 5,000 rows of the data split out by each input variable. It just gives you an idea of the contributions of those variables, both positive and negatively to the predictions. Now I'm going to hand it back over to Peter. Great. Thanks, Laura. I think now we'll go ahead and transition to our second case study that Florian is going to do. Should I pass the screen share back to you? Yeah, that would be great. Make the transition. Thanks for this first case study and thanks for the contributions. Really interesting. I hope we can bring some light onto a different kind of application with our second case study. I have given it the subtitle, Was it Torque? Because that's a question we'll have hopefully answered by the end of the second case study presentation. This second case study is about predictive maintenance and the particular aspects of why it is important to understand your models in this scenario. Most likely everybody can think that it's very important to have a sense for when machines require maintenance. Because if machines fail, then that's a lot of trouble, a lot of costs, when plants have to shut down and so on. It's really necessary to do preventative maintenance to keep systems running. A major task in this is to determine when the maintenance should be performed and not too early, not too late, certainly. Therefore, it's a task to find a balance which limits failures and also saves costs on maintenance. There's a case study that we're using to highlight some functions features and it's actually a synthetic data set which comes from a published study. The source is down there at the bottom. You can find it. It was published in the AI for Industry event in 2020. The basic content of this dataset is that it has six different features of process settings, which are product or product type which denotes for different quality variants. Then we have air temperature, process temperature, rotational speed, torque, and tool wear. We have one main response and that is whether the machine fails or not. When we think of questions that we could answer with a model or models or generally data. There's several that come to mind. Now, the most obvious question is probably how we can explain and interpret settings, which likely lead to machine failure. This is something that [inaudible 00:24:38] to create and compare multiple models and then choose the one that's most suitable. Now, in this particular setting where we want to predict whether a machine fails or not. We also have to account for misclassifications that is either a false positive or a false negative prediction. With JMP's decision threshold graphs and the profit matrix, we can actually specify an emphasis or importance to which outcome is less desirable. For example, it is typically less desirable to actually have a failure when the model didn't predict one compared to the opposite, misclassification. Then besides the binary classification, of course, you'd be also interested in understanding what drives failure typically. There are certainly several ways to deal with this question. I think visualization is always a part of it. But when we're using models we can consider, for example, this self explaining models like decision trees or we can use built- in functionality like the prediction profiler and the variable importance feature. The last point here when we investigate and rate which factors are most important for the predictive outcome, we assume that there is an underlying behavior. The most important factor is XYZ, but we do not know which factor actually has contributed to what extent to an individual prediction. A gain, Shapley values are a very helpful addition that can allow us to understand the contribution for each of the factors in individual prediction. On a general level, now, let's take a look into three specific questions and how we can answer those with the software. The first one is how do we adjust predictive model with respect to the high importance of omitting false negative predictions? This assumes a little bit that we've already done a first step because we've already seen model screening and how we can get there. I'm starting one step ahead. Let's move into JMP to actually take a look at this. We see the dataset, we can see it's fairly small, not too many columns. It looks very simple. We only have these few predictors and there's some more columns. There's also a validation column that I've added, but it's not shown here. As for the first question, let's assume we have already done the model screening. Again, this is accessible on the analyzed predictive model screening where we don't specify what we want to predict and the factors that we want to investigate. Again, I have already prepared this. We have an outcome that looks like this. It looks a little bit different than in the first use case because now we have this binary outcome and so we have some different measures that we can use to compare. But again, what's important is that we have an overview of which of the methods are performing better than other ones. As we said, in order to now improve the model and emphasize on omitting these false negative predictions. Let's just pick the one and see what we can do here. Let's maybe even pick the first three here, so we can just do that by holding the control key. Another feature that will help us here is called decision threshold and it's located in the red triangle decision threshold. The decision threshold gives us several contents. We have these graphs here, these shows the actual data points. We have this confusion matrix and we have some additional graphs and matrix, but we will focus on the upper part here. Let's actually take a look at the test portion of the set. When we take a look at this, we can see that we have different types of outcomes. The default of this probabilities threshold is the middle, which would be here at .5. We have now several options to see and optimize this model and how effective it is with respect to the confusion matrix. Confusion matrix, we can see the predicted value and whether that actually was true or not. If we look at when o failure is predicted, we can see that here, with this setting, we actually have quite a high number of failures, even though there were no predicted. Now we can interactively explore how adjusting this threshold actually affects the accuracy of the model or the misclassification rates. Or in some cases, we can also put an emphasis on what's really worse than an other failure. We can do this with the so called profit matrix. If we go here, we can set a value on which of the misclassifications is actually worse than the other one. In this case, we really do not want to have a prediction of no failure when there actually is a failure happening. We would put something like 20 times. More importantly, we do not get this misclassification and we set it and hit okay, and then it will automatically update the graph and then we can see that the values for the misclassification have dropped now in each of the models and we can use this as an additional tool to select a model that's maybe most appropriate. That's for the first question of how we can adjust a predictive model with respect to the higher importance of omitting false negative predictions. Now, another question here is also when we think of maintenance and where we put our efforts into maintenance, is how can we identify and communicate the overall importance of predictors? What factors are driving the system, the failures? Let's go back to the data table to say that first, I personally like visual and simplistic ones. One of them that I like to use is stuff like the parallel plot. Because it's really a nice overview summarizing where the failures group and which parameters settings and so on. On the modeling and machine learning side, there's a few other options that we can actually use. One that I like because it's very crisp and clear, is the predictor screening. Predictor screening gives us very compact output about what is important and it's very easy to do and it's under analyzed screening, predictor screening. A ll we need to do is say what we want to understand and then specify the parameters that we want to use for this. Click okay, and then it recalculates and we have this output. For me, it's a really nice thing because as I said, crisp and clear and consuming. But we've talked about this before and Russ, when we're working with models particularly, do you have any other suggestion or do you have anything to add to my approach to understanding the factors of which predictors are important. Yes, it is a good thing to try. As I mentioned earlier, you got to be really careful about overfitting. I tend to work with a lot of these wide problems, say from Genomics and other applications, where you might even have many more predictors than you have observations. In such a case, if you were to run predictor screening, say maybe pick the top 10 best and then go turn right around and fit a new model with those 10 only, you've actually just set yourself up for overfitting if you did the predictor screening on the entire data set. That's the scenario I'm concerned about. It's an easy trap to fall into, because you think you're just filtering things down, but you've got to reuse the same data twice. The danger would be if you were to plug then apply that model to some new data, it likely won't do nearly as well. If you're in the game where you want to reduce predictors, I tend to like to prefer to do it within each fold of a K-fold. The drawback of that is you'll get a different set every time, but you can't aggregate those things. If you got a certain predictor that's just showing up consistently across folds. It's very good evidence of that. That's a very important one. I expect that's what would happen in this case with, say, torque. Even if you were to do this exercise, say 10 times with 10 different folds, you'd likely get a pretty similar ranking, but it's more of a subtlety but again, a danger that you have to watch out for. Job can make it a little bit easier just because things are so quick and clean, like you mentioned, that you might fall into that trap if you're not careful. Yeah, that's very valuable addition to this approach. Just accompanying to this additional information, there's also the other option that we have, particularly when we have already gone through the process of creating a model where we can then actually again, use the prediction profiler and the variable importance. It's another way where we can assess which of the variables have the higher importance. Russ, do you want to say word on that also in contrast maybe to the predictor screening? Yeah. Honestly, Vogt, I like the importance were it a little better. Just dive right into the modeling. Again, I would prefer with K-fold. Then you can just use the variable importance measures, which are often really informative directly. They're very similar. In fact, predictor screening, I believe, it's just calling bootstrap forest in the background and collecting the most important variables. It's basically the same thing. Then following up with the profiler, which can be excellent for seeing exactly how certain variables are marginally affecting the response, and then drilling that even further with Shapley to be able to break down individual predictions into their components. To me, it's a very compelling and interesting way to dive into a predictive model and understand what's really going on with it, kind of unpacking the black box and letting you see what's really happening. Yeah, thanks. I think that's the whole point, making it understandable and making it consumable besides, of course, actually getting to the results, which is understanding which factors are influencing the outcome. Thanks. Now, I have one more question, and you've already mentioned it. When we score new data, in particular, what can we do to identify which predictors have actually influenced the model outcome? Now, with what we have done so far, we have gained a good understanding of the system and know Which of the factors are the most dominant and we can even derive operating ranges. If the system changes, what if a different factor actually drives a failure? Then as it would be expected in this case and we have talked to Laura beforehand, and Shapley V alues again are a great addition that will help us to interpret. we've seen how we can generate them, and you've learned on which platforms they'll be available. Now, the output that you get when you save out Shapley Values is, for example, also a graph that shows us per actual per row. In this case, we have 10,000 rows in the data tab, so we have 10,000 stack bar charts, and then we can already see that besides the common pattern, there's also times when there's actually other influencing factors that drive the decision of the model. It's really a helpful tool to not only raise an individual prediction, but also add on to that what Russ just said, understanding of the system, which factors contribute. When we move a little bit on in this understandable or exploratory path, we can use these Shapley Values in different ways. What I personally liked was the suggestion to actually plot the Shapley value by their actual parameters setting, because that allows us to identify areas of settings. For example, if we take rotational speed here, we can see that there's actually areas of this parameter that tend to contribute a lot in terms of the model outcome, but also in terms of the actual failure. That also helps us in getting more understanding with respect to the actual problem of machine failure and what's causing it, and also with respect to why the model predicts something. Now, finally, I like to answer the question. When we take these graphs of Shapley values, and we have seen it before in several occasions, stock is certainly a dominant factor. But from all of these, I've just picked a few predictions, and we can see that sometimes it stocks, sometimes it's not. With the Shapley values, we have really a great way to interpreting a specific prediction by the model. All right, so those were the things we wanted to show. I hope this gives some great insight into how we can make AI models more explainable, more understandable, more easy to digest and to work with, because that's the intention here. Yeah, I'd like to summarize a little bit. Pete, maybe you want to come in and help me here. I think what we're hoping to show is that as these AI models become more and more prevalent and are relied upon for decision making, that understanding, interpreting, and being able to communicate those models is very important. We hope that with these Shapley v alues, with the variable importance, and with the profiler, we've shown you a couple of ways that you can share share those results and have them easily understandable. That was the take- home there between that and being able to utilize model screening, and things like that, that hopefully, you found a few techniques that will make this more understandable and less of a black box. Yeah, I absolutely agree. Just to summarize, I really like to thank Russ and Laura for contributing here with your expertise. Thanks, Pete. It was a pleasure. Thanks, everybody for listening. We're looking forward to having discussions and questions to answer.
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Using JMP for Everyday Automation (2022-US-30MP-1143)
Monday, September 12, 2022
While JMP is definitely a time-saver in the world of statistical problem solving, with a small amount of up front JSL scripting, JMP also can be a huge time saver for the more mundane aspects of daily work. This presentation demonstrates how to free up time for more meaningful work using JSL to automate daily workflows, send automatic reports as emails, perform automated file management and more. While there will be some scripting involved, this presentation is accessable to all, as JMP will do most of the heavy lifting (scripting). The presentation is live demonstration using JMP only. Hi, this is Jed Campbell. I'm here to present today on maybe a little bit different than what we normally think of as things to go and learn about at Discovery. It turns out that when you really look at it, there's two things that we all do. There are two things that we all do, and really, it comes down to the things that we want to do in life and the things that we don't want to do in life. I think Randall Munroe summed this up. In the last few years, we've all spent a lot of time doing things that we don't really want to do. But that comes down to the same at work as well. I think a lot of the time when we come to JMP Discovery, we want to focus on the things we want to do. We want to focus on learning how to do better and faster analysis, more statistical tools. Today, I'm actually going to focus on the things that we don't want to do, the mundane tasks that make life less worth living, I guess. We're going to focus on ways to make those simpler, better, and faster more. But before we begin, I want to maybe talk a little bit about history. In 1981, I remember my dad came home with our first computer. It was a Commodore VIC- 20, had 4k of Ram, and I had to look this up, but a 1- Megahertz processor, and it was an 8-bit processor. A nything we wanted to do in that, we had to program. I'm presenting today on a basic laptop that you can buy at any store with 16 gigs of RAM, a super fast processor, and 64 bit processor. Really it comes down to what is the difference between a million and a billion? Notice that we have a log scale here. This Commodore V IC-20 was at the beginning of things, and now we're somewhere in this range. Essentially, the difference between a million and a billion is a billion. Or the difference between an old school processor and current processors is almost infinite. What that means is that while elegant code can feel good to write and can execute a little bit faster, it's really hard to decipher. Brute force code, on the other hand, is lazy and easy to read. The good news is there are plenty of CPU cycles for a brute force approach. That's what we're going to be talking about. I don't think we need to do anything wild or crazy to make mundane tasks more easy for us in JMP. Before I begin, I just like to say nobody likes to be watched using a computer. Obviously, if something goes wrong, this journal is already available on the community.jmp.com, and we'll make it work. But anyway. Also, shout out to the Scripting Index. I want to make sure that if you find yourself doing scripting in JMP, that you should probably set a keyboard shortcut for the Scripting Index because it really speeds things up. A way to do that would be to come up to the View menu. View, Customize, Menus and Toolbars, and then you can find Scripting Index. I've already assigned the keyboard shortcut of Control Shift X. I can just reassign that right now. What that means is anytime I want to see the Scripting Index, I can hit that Control Shift X, and it pops up. Super useful, and I use it a ton. This little button right here is going to make a folder because we're going to be doing some file manipulation. I'm just going to hit the OK button, and now I have a folder with an Excel document in it and a PowerPoint document in it. For those of you familiar with demos in JMP, you may see that this Excel document looks suspiciously like big class because that's what it is, as well as the PowerPoint document. But we're going to be messing around with some of those things in a little while. That all being said, let's get started. As we talk about things that we don't necessarily want to do but must do. I work in a quality group, which means there are a lot of requirements that customers have for us, that governing bodies have for us that just need to be done. I could spend a lot of time doing those things. Or I could make, for example, a few little dashboards. These are dumbed down versions of those dashboards. They're all functional in here if you're going through this journal on your own later. For example, somebody emails you a folder and emails you something in Excel, and you have to do something with it, or you have to send emails to people, you have to do some work in some weird random network folder that you can never remember. But we'll go through some of these examples together, but just know that those are there to interact with later. First, let's talk about the web command in JSL. It turns out it's not just for websites, and it's actually, sure, you can open a website with it, but you can also open a file from a SharePoint, or a Google Doc, or whatever cloud service your organization uses. You can open a folder on your computer. You can also open a non-junk document in its native thing because remember, the JMP, it's more like saying, "Computer, do something for me the way you're used to doing it." For example, we can open an Excel file in Excel, if you ever need to do that. You can also use it to compose an editable email. In fact, looking on Wikipedia, URI or Uniform Resource Identifier is a special way that computers can do all sorts of things. You can tell your computer, as I'm scrolling through all of these, apparently, there's an ed2k thing, if you know that. If you're on an Apple computer, and you want to FaceTime someone, you can use this protocol to script that. Microsoft computers, you can do the same thing. There's a whole bunch of config settings. You could use to take cameras, take pictures with the camera, or change notifications. The list is pretty much endless. Let's actually dive into it now. This button here. Okay. Again, we've seen, and we think that the web command. Yeah, of course it works for opening web addresses. If I use the Enter key on my numeric keypad that runs that line of the script, and that's what I expect. We can also copy and paste, like I mentioned before, a URL from a SharePoint site and make a link to that with JMP. You can use it to open a folder. Here is the shortcut to open that folder that we've been working with, that we're going to be working with. You can open a non-JMP document with it. Instead of using the open command, you would use web, and here is a PowerPoint document. You can open an Excel file in JMP or in Excel. Using the open command here is something. If I run this line, it will open that Excel file that's in our folder in JMP. But if that's not what the behavior that I want, I can use the web command instead and it will open that file in Excel. If I need for some reason a feature that only Excel has, or I'm working with people that don't have JMP. This is where it gets more fun. I just chose the mail to URI scheme as something to experiment with. This is just an online URL mail to generator. We can open this and we can say, to test@example.com. What's the subject? JMP. Is great, lots of exclamation marks. Then it will generate that URL for me. A ll I have to do is CTRL C to copy to my clipboard. Then I can come in here and make a new line. Web, open quotes, paste that in, put my semicolon at the end. Now I get an email pop- up that I can edit before I send it. Sometimes that's really nice. JMP's mail command doesn't allow you to modify an email before you send it. For example, if you have a long list of people that need to be emailed, but you want to personalize each message before you send it, this is a way to do that. I'm not going to go ahead and send that because I don't want those people to get bombarded. A gain, here's another list of URI schemes. There are tons of ways to do that. We just chose the mail to one as an example. That being said, we've learned one way to do mail. If you're not familiar with JMP's built- in email command, let's go over that, and maybe say, for example, you have a report or a dashboard that you output and you need to email a bunch of people, but they don't all necessarily have JMP. Let's walk our way through opening a data table, and running a script on it, and then saving that as an interactive HTML, and then we'll email that. In this case here, I'm just going to tell JMP to open the Big Class data table from its sample data folder, and store that in something called data table bc, dtbc for Big Class data table. Then I'm going to tell that Big Class table to run the script that's already saved to the table called Bivariate, and do that in a new window called NW. Then all I need to do is tell NW to Save Interactive HTML. I should if everything works right, I probably could have scripted this to close automatically for me. But I need to get to that folder now that I've closed. I'm going to cheat a little bit and open this folder. Now it has that report that htm, and back to our regular program here. Now I can run this line, and JMP will automatically open it in my default web browser. If I've got that file, and now I want to email it to people. It should be relatively simple for me, but it turns out it's maybe a little bit more complicated than you'd expect. If you have a static list of people you want to email, like test and test2 @example.com, "Hey, here's your monthly reminder to do the thing," and I want to attach this file that I've done, I can select those lines, I can run them. My organization is going to pop up a little Allow or Deny for each person. Your security systems may be a little bit different. It's not perfect yet. I haven't found a way to get around this, but that just sent two emails. Now, you might think that since this is a list here, that I could put that list into a variable and then mail it, but for some reason it doesn't work. We'll go ahead and run it and it'll give us an error message. I tried a couple of different ways of evalling this list, eval, parse, all those things, and then I just thought, "Well, no, I'm more in favor of brute force here." One way to do it is I can throw this into a list to all my email addresses, say, there are 47 of them, and then I can just iterate through it with a for loop. When we run that, that works just fine, and it pops up that message that my organization requires. That is another way to do it. You can either use the URI scheme to pull up an editable email, or you can have JMP do the whole thing for you. It works pretty well. I'm sure somebody in the comments or in the group will know what I'm doing wrong here, and I'm looking forward to learning about that. But back to this. We know we can email files. We can also do other things with files. For example, we can automate creating a folder structure. At the beginning of each year, there are a couple of different places, where I need to make a folder with a different subfolder in it for each month so that we can figure out where things are being filed. Now I could walk through in Explorer, and I could create a new folder, and give it a name, and then go into that new folder, and create another new folder, and give it a name, and oh, my word, that just seems too mundane. I'd almost rather die. Or I can just do it right here. I can create a variable that is just the year that we are, and then I can iterate through months 1-12 and create a directory with the year and then that month. Let's go ahead and run that, and then I should be able to pop into this. I see a directory that wasn't there before, 2022, and when I go into it, it's already done for me. I'm so happy I could cry because I don't have to waste my life manipulating this. That's just one example and maybe another example, probably our biggest example that we're going to do today would be reviewing a list of files. For example, part of my job, each month, different people do different audits, and those audits are stored in folders. But those audits, for example, are not stored in just one set of folders. It's lots of folders. I could comb through all of them to learn which ones were done so that I could review each of them. Or I can just make a script to do it. Here's the script, which is a little bit more complicated, but it's really still just brute force, and it's just piecing things together. First that we're going to do is we're going to execute this code right here, which is going to get a date, which is the beginning of last month, and this is being recorded in July, so that's that date there. Then I just want to go into this folder that we've been working with, and get a list of all the files in this directory. There's a recursive, and that just means also look through the subfolders because I don't want to. I can run through that, and I can get this list of new files together. Sorry, I can get this list of files, and I can hover over that, and I can see that it has the documents we've been looking at. From that list of all the files, I need to know just which ones were the new files. To do that, I'll create an empty list. Then after this empty list, I'm going to look at the creation date of each of the files in this list. If that creation date is bigger than this date that I set earlier, then I'll keep it, and I'll say it's in the new list. That's all nice, and fine, and dandy, let's run that, and nothing will happen. I want to actually be able to see it. I can tell JMP to show me a new list or a new window with a list box of those new files. There I have the list. That's nice, but I can't really do anything with it yet. If I just copy and paste this down here below and reformat it a little bit. It's the same script as before. But now I just append a little button box at the bottom that says Open, and I tell JMP that when I click on it, on that open button, tell me what I have selected, and then going back to the beginning here, use that web function. Open that file. Well, then I get something that presents me with a list and allows me to select each of them or one of them and hit the Open button. Honestly, this one little thing saves me so much time each month and so much hassle of trying to comb through lots of different folders and find which things I need to review. I think the lesson of this is, not so much that you have to do the exact same things that I've done, but more to start thinking about what can you do with this? Now, I'm not saying that you could hack your job and get paid to do nothing for five years, that's mostly just there for a laugh. But what I am saying is if we go back to that notion of we all do two things, the things that we want to do and the things that we have to do, maybe we can challenge ourselves to find a way to save, I don't know, 30 minutes a week, 30 minutes a month, by automating the tasks that we don't really want to do. That way then we can focus a little bit more on the tasks that we do want to do. I'd love to hear your comments and love to hear how you've succeeded with this. Thanks.
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Exploration and Visualization
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Comparing Predictive Model Performance with Confidence Curves (2022-US-45MP-1141)
Monday, September 12, 2022
Repeated k-fold cross-validation is commonly used to evaluate the performance of predictive models. The problem is, how do you know when a difference in performance is sufficiently large to declare one model better than another? Typically, null hypothesis significance testing (NHST) is used to determine if the differences between predictive models are “significant”, although the usefulness of NHST has been debated extensively in the statistics literature in recent years. In this paper, we discuss problems associated with NHST and present an alternative known as confidence curves, which has been developed as a new JMP Add-In that operates directly on the results generated from JMP Pro's Model Screening platform. Hello. My name is Bryan Fricke. I'm a product manager at JMP focused on the JMP user experience. Previously, I was a software developer working on exporting reports to standalone HTML fire files. JMP Live and JMP Public. In this presentation, I'm going to talk about using Confidence Curves as an alternative to null hypothesis significance testing in the context of predictive model screening. Additional material on this subject can be found on the JMP Community website in the paper associated with this presentation. Dr. Russ Wolfinger is a Distinguished Research Fellow at JMP and a co- author, and I would like to thank him for his contributions. The Model Screening Platform, introduced in JMP Pro 16 allows you to evaluate the performance of multiple predictive models using cross validation. To show you how the Model Screening platform works, I'm going to use the Diabetes Data table, which is available in the JMP sample data library. I'll choose model screening from the analyzed predictive modeling menu. JMP responds by displaying the Model Screening dialogue. The first three columns in the data table represent disease progression in continuous binary and ordinal forms. I'll use the continuous column named Y as the response variable. I'll use the columns from age to glucose in the X factor role. I'll type 1234 in the set random seed input box for reproducibility, I'll select the check box next to K-Fold cross validation and leave K set to five. I'll type 3 into the input box next to repeated K-F old. In the method list, I'll unselect neural. Now I'll click Okay. JMP responds by training and validating models for each of the selected methods using their default parameter settings and cross validation. After completing the training and validating process, JMP displays the results in a new window. For each modeling method. The Model Screening platform provides performance measures in the form of point estimates for the coefficient of determination, also known as R squared, the root average squared error, and the standard deviation for the root average squared error. Now I'll click select dominant. JMP responds by highlighting the method that performs best across the performance measures. What's missing here is a graphic to show the size of the differences between the dominant method and the other methods, along with the visualization of the uncertainty associated with the differences. But why not just show P- values indicating whether the differences are significant? Shouldn't a decision about whether one model is superior to another be based on significance? First, since the P- value provides a probability based on a standardized difference, a P-value by itself loses information about the raw difference. A a significant difference doesn't imply a meaningful difference. Is that really a problem? I mean, isn't it pointless to be concerned with the size of the difference between two models before using significance testing to determine whether the difference is real? The problem with that line of thinking is that it's power or one minus beta that determines our ability to correctly reject a null hypothesis. Authors such as Jacob Cohen and Frank Smith have suggested that typical studies have the power to detect differences in the range of .4 to .6 . So let's suppose we have a difference. Where the power to detect a true difference? Zero five at an alpha level zero five. That suggests we would detect the true difference, on average 50% of the time. So in that case, significance testing would identify real differences no better than flipping an unbiased coin. If all other things are equal, type 1 and type 2 errors are equivalent. But significance tests that use an alpha value of .05. Often implicitly assume type 2 errors are preferable to type 1 errors, particularly if the power is as low as .5 . A common suggestion to address these and other issues with significance testing is to show the point estimate along with the confidence intervals. One objection to doing so is that a point estimate along with a 95% confidence interval is effectively the same thing as significance testing. Even if we assume that is true, a point estimate and confidence interval still puts the magnitude of the difference and the range of the uncertainty front and center, whereas a loan P-value conceals them both. So various authors, including Cohen and Smith, have recommended replacing significance testing with point estimates and confidence intervals. Even so, the recommendation to use confidence intervals begs the question, which ones do we show? Showing only the 95% confidence interval would likely encourage you to interpret it as another form of significance testing. The solution provided by Confidence Curves is to literally show all confidence intervals up to an arbitrarily high confidence level. How do I show Confidence Curves and JMP? To conveniently create Confidence Curves and JMP, install the Confidence Curves add-in by visiting the JMP Community Homepage. Type Confidence Curves into the search input field. Click the Confidence Curves result. Now click the download icon next to confidencecurves. JMPa dd- in. Now click the downloaded file. JMP responds by asking if I want to install the add- in. You would click Install. However, I'll click cancel as I've already installed the add-in. So how do you use the add-in? First. To generate Confidence Curves for this report, select Save Results table from the top red triangle menu located on the Model Screening Report window. JMP responds by creating a new table containing among others. The following columns trial. Which contains the identifiers for the three sets of cross validation results. Fold. Which contains the identifiers for the five distinct sets of subsamples used for validation in each trial. Method. Which contains the methods used to create models from the test data. And in. Which contains the number of data points used in the validation folds. Note that the trial column will be missing if the number of repeats is exactly one, in which case the trial column is neither created nor needed. Save for that exception, these columns are essential for the Confidence Curves add-in to function properly. In addition to these columns, you need one column that provides the metric to compare between methods. I'll be using R squared as the metric of interest in this presentation. Once you have the model screening results table, click add-ins from JMPs main menu bar and then select Confidence Curves. The logic that follows would be better placed in a wizard, and I hope to add that functionality in a future release of this add-in. As it is, the first dialog that appears requests you to select the name of the table that was generated when you chose 'save results table', from the Model Screen Reports Red Triangle menu. The name of the table in this case is Model Screen Statistics Validation Set. Next, a dialogue is displayed that requests the name of the method that will serve as the baseline from which all the other performance metrics are measured. I suggest starting with the method that was selected when you click the Select Dominant option in the Model Screening Report window, which in this case is Fit Stepwise. Finally, a dialog is displayed that requests you to select the metric to be compared between the various methods. As mentioned earlier, I'll use R squared as the metric for comparison. JMP responds by creating a Confidence Curve table that contains P-values and corresponding confidence levels for the mean difference between the chosen baseline method and each of the other methods. More specifically, the generated table has columns for the following. Model. In which each row contains the name of the modeling method whose performance is evaluated relative to the baseline method. P- value in which each row contains the probability associated with the performance difference at least as extreme as the value shown in the difference in our square column. Confidence interval in which each row contains the confidence level. We have that the true mean is contained in the associated interval and finally, difference in our square, in which each row is the maximum or minimum of the expected difference in R squared associated with the confidence level shown in the confidence interval column. From this table, Confidence Curves are created and shown in the Graph Builder graph. So what are Confidence Curves? To clarify the key attributes of a Confidence Curve, I'll hide all but the Support Vector machine's Confidence Curve using the local data filter by clicking on Support Vector Machines. By default, a Confidence Curve only shows the lines that connect the extremes of each confidence interval. To see the points, select Show Control Panel from the red triangle menu located next to the text that reads Graph Builder in the Title bar. Now I'll shift click the points icon. JMP responds by displaying the endpoints of the confidence intervals that make up the Confidence Curve. Now I will zoom in and examine a point. If you hover the mouse pointer over any of these points. A hover label shows the P - value confidence interval, difference in the size of the metric and the method used to generate the model being compared to the reference model. Now we'll turn off the points by shift clicking the points icon and clicking the Done button. Even though the individual points are no longer shown, you can still view the associated hover label by placing the mouse pointer over the Confidence Curve. Point estimate for the main difference in performance between the sport vector machines and Fit Step Wise models is shown at the 0% confidence level, which is the mean value of the differences computed using cross validation. A Confidence Curve plots the extent of each confidence interval from the generated table between zero and the 99.99% confidence level along with the left Y axis. P values associated with the confidence intervals are shown along the right y axis. The confidence level associated with each confidence interval shown. The Y axis uses a log scale so that more resolution is shown at higher confidence levels. By default, two reference lines are plotted alongside a Confidence Curve. The vertical line represents the traditional null hypothesis of no difference in effect. Note you can change the vertical line position and thereby the implicit null hypothesis. In the X axis settings. The horizontal line passes through the conventional 95% confidence interval. As with the vertical reference line, you can change the horizontal line position and thereby the implicit level of significance by changing the Y axis settings. If a Confidence Curve crosses the vertical line above the horizontal line, you cannot reject the null hypothesis using significance testing. For example, we cannot reject the null hypothesis for support vector machines. On the other hand, if a Confidence Curve crosses the vertical line below the horizontal line, you can reject the null hypothesis using significance testing. For example, we can reject the null hypothesis for boosted tree. How are Confidence Curves computed? The current implementation of confidence curves assumes the differences are computed using R times repeated K-fold cross validation. The extent of each confidence interval is computed using what is known as a variance corrected resampled T-test. Note that authors Claude Nadeau and Yoshua Bengio, note that a corrected resampled T-test is typically used in cases where training sets are five or ten times larger than validation sets. For more details, please see the paper associated with this presentation. So how are Confidence Curves interpreted? First, the Confidence Curve graphically depicts the main difference in the metric of interest between a given method and a reference method at the 0% confidence level. So we can evaluate whether the mean difference between the methods is meaningful. If the main difference isn't meaningful, there's little point in further analysis of a given method versus the reference method with respect to the chosen metric. What constitutes a meaningful difference depends on the metric of interest as well as the intended scientific or engineering application. For example, you can see the model developed with a decision tree method is on average about 14% worse than Fit Step Wise, which arguably is a meaningful difference. If the difference is meaningful, we can evaluate how precisely the difference has been measured by evaluating how much the Confidence C urve width changes across the confidence levels. For any confidence interval not crossing the default vertical axis, we have at least that level of confidence that the main difference is nonzero. For example, the decision tree confidence curve doesn't cross the Y axis until about the 99.98% confidence level. We are nearly 99.98% confident the mean difference isn't equal to zero. In fact, with this data, it turns out that we can be about 81% confident that Fit Step Wise is at least as good, if not better, than every method other than generalized regression Lasso. Now let's consider the relationship between Confidence Curves. If two or more Confidence Curves significantly overlap and the mean difference of each is not meaningfully different from the other. The data suggest each method performs about the same as the other with respect to the reference model. So for example, we see that on average, the Support vector Machines model performs less than . 5% than Bootstrap Forest, which is arguably not a meaningful difference. The confidence intervals do not overlap until about the 4% confidence level, which suggests these values would be expected if both methods really do have about the same difference in performance with respect to the reference. If the average difference in performance is about the same for two confidence curves, but the confidence intervals don't overlap too much, the data suggests the models perform about the same as each other with respect to the reference model. However, we are confident of a non- meaningful difference. This particular case is rare than the others and I don't have an example to show with this data set. On the other hand, if the average difference in performance between a pair of Confidence Curves is meaningfully different and the confidence curves have little overlap, the data suggests the models perform differently from one another with respect to the reference. For example, the generalized regression Lasso model predicts about 13.8% more of the variation in the response than does the decision tree model. Moreover, the Confidence Curves don't overlap until about the 99.9% confidence level, which suggests these results are quite unusual if the methods actually perform about the same with respect to the reference. Finally, if the average difference in performance between a pair of Confidence Curves is meaningfully different from one another and the curves have considerable overlap, the data suggests that while the methods perform differently from one another with respect to the reference, it wouldn't be surprising if the difference is furious. For example, we can see that on average support vector machines predicted about 1.4% more of the variance in the response than did [inaudible 00:19:18] nearest neighbors. However, the confidence intervals begin to overlap at about the 17% confidence level, which suggests it wouldn't be surprising if the difference in performance between each method and the reference is actually smaller than suggested by the point estimates. Simultaneously, it wouldn't be surprising if the actual difference is larger than measured, or if the direction of the difference is actually reversed. In other words, the difference in performance is uncertain. Note that it isn't possible to assess the variability and performance between two models relative to one another when the differences are relative to a third model. To compare the variability and performance between two methods relative to one another, one of the two methods must be the reference method from which the differences are measured. But what about multiple comparisons? Don't we need to adjust the P-values to control the family wise type 1 error rate? In this paper about Confidence Curves, Daniel Burrough suggests that adjustments are needed in confirmatory studies where a goal is prespecified, but not in exploratory studies. This idea suggests using unadjusted P-values for multiple Confidence Curves in an exploratory fashion and only a single Confidence Curve generated from different data to confirm your findings of a significant difference between two methods when using significance testing. That said, please keep in mind the dangers of cherry picking, p-hacking when conducting exploratory studies. In summary, the model screening platform introduced in JMP Pro 16 provides a means to simultaneously compare the performance of predictive models created using different methodologies. JMP has a long standing goal to provide a graph with every statistic, and Confidence Curves help to fill that gap. For the model screening platform. You might naturally expect to use significance testing to differentiate between the performance of various methods being compared. However, P-values have come under increased scrutiny in recent years for obscuring the size of performance differences. In addition, P-values are often misinterpreted as the probability the null hypothesis is true. Instead of P-value is the probability of observing a difference as or more extreme, assuming the null hypothesis is true. The probability of correctly rejecting the null hypothesis when it is false is determined by a power or one minus beta. I have argued that it is not uncommon to only have a 50% chance of correctly rejecting the null hypothesis with an alpha value of .05 . As an alternative, a confidence interval could be shown instead of a loan P- value. However, the question would be left open as to which confidence level to show. Confidence Curves address these concerns by showing all confidence intervals up to an arbitrarily high- level of confidence. The mean difference in the performance is clearly visible at the 0% confidence level, and that acts as a point estimate. All of the things being equal, type 1 and type 2 errors are equivalent. Confidence Curves don't embed a bias towards trading type 1 errors for type 2. Even so, by default, a vertical line is shown in the confidence curve graph for the standard null hypothesis of no difference. In addition, a horizontal line is shown that delineates the 95% confidence interval, which readily affords a typical significance testing analysis if desired. The defaults for these lines are easily modified, but different null hypothesis and confidence levels desired. Even so, given the rather broad and sometimes emphatic suggestion to replace significance testing with point estimates and confidence intervals, it may be best to view a Confidence Curve as a point estimate along with a nearly comprehensive view of its associated uncertainty. If you have feedback about the confidence curve's add-in, please leave a comment on the JMP community site. And don't forget to vote for this presentation if you found it interesting or useful. Thank you for watching this presentation and I hope you have a great day.
Labels
(12)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
JMP Pro 17 Remedies for Practical Struggles with Mixture Experiments (2022-US-45MP-1140)
Monday, September 12, 2022
In mixture experiments, the factors are constrained to sum to a constant. Whether measured as a proportion of total volume or as a molar ratio, increasing the amount of one factor necessarily leads to a decrease in the total amount of the other factors. Sometimes also considering unconstrained process factors, these experiments require modifications of the typical design and analysis methods. Power is no longer a useful metric to compare designs, and analyzing results is far more challenging. Framed in the setting of liquid nanoparticle formulation optimization for in vivo gene therapy, we use a modular JSL simulation to explore combinations of design and analysis options in JMP Pro, highlighting the ease and performance of the new SVEM options in Generalized Regression. Focusing on the quality of the candidate optima (measured as a percentage of the maximum of the generating function) in addition to the prediction variance at these locations, we quantify the marginal impact of common choices facing scientists, including run size, space-filling vs optimal designs, the inclusion of replicates, and analysis approaches (full model, backward AICc, SVEM-FS, SVEM-Lasso, and a neural network used in the SVEM framework). Hello, everyone, and welcome to JMP Discovery. Andrew Karl and Heath Rushing. I have a presentation today that's going to highlight some of the features in JMP 17 Pro that will help you out in the world of mixture models. Many of you are involved in formulations, and to be honest, that's what we've been doing a lot lately. We're a lot like ambulance chasers, where we'll just go after the latest thing that customers are interested in. But that's what we're seeing a lot lately, is folks that are doing mixture models that are actually quite complex and more so than we'd ever know. We just decided we would do some deeper investigation with some of the new techniques that are out, and that JMP 17 performs for us. Andrew, would you like to get started and maybe give a little bit of background on the whole idea of what a mixture model is, and some of the other techniques? Okay, so let's start out with a nice, easy graph. Let's take a look over here at the plot on the left. We're in an experimental setting, so now I suppose we've got two factors, Ingredient A and Ingredient B, and they range from 0-1. If there's no mixture constraints, then everything in the black square is a feasible point in this factor space, and so our design routine is going to give us points somewhere in this space. However, if there's a mixture constraint where these have to add up to one, then only the red line is feasible. We want to get a conditional optimum given that constraint, and we want to end up somewhere in that line for both our design and our analysis. If we move up to three mixture ingredients, A, B and C, all able to vary from 0-1, then we get a cube for that 0-1 constraint for each of them. But with the mixture constraints, that takes the form of a plane intersecting a cube, and that gives us this triangle, so only this red triangle is relevant out of that entire space. If we have four dimensions, if we have four mixture factors, then that allowable factor space is actually a three- dimensional subset, a pyramid within there. Looking back to the three- mixture setting. See this triangle? That's the allowed region. Well, that's why JMP gives us these ternary plots. For these ternary plots, what JMP will do is, if you have more than three mixture factors, is you'll have two factors shown at a time, and the third axis will be the sum of all the other mixture factors. We can look at these ternary plots, rather than having to have a pyramid that we're looking throughout. We have to decide, do we want a Space Filling Design or an optimal design? Now, normally in a non-mixture setting, we'd normally use an optimal design, and for the most part, we wouldn't consider a Space Filling Design. There's a few reasons that we want to consider a Space Filling Design in mixture settings. Often in the formulations world, if you go too far, there's a little bit more sensitivity to going too far in your factor space, making it too wide, then your entire process fails. S uppose that happens over here where X2 is. Suppose it fails everything below 0.1. You're going to lose a good chunk of your runs because the optimal design tends to put most of your runs on the boundary of the factor space, so you're going to lose these. You're not going to be able to run your full model with the remaining points, and you're not going to have any good information about where that failure boundary is. For the Space Filling Design, if you have some kind of failure below 0.1, you're losing a smaller number of points. Your remaining points still give you a Space Filling Design in the existing space that you can use to fit the model effects, and now we're going to be able to model that failure boundary. Also in the mixture world, we often see higher order effects active: interactions or curvatures, polynomials, than we might see in the non- mixture setting. If we don't specify those, because any of these models are optimal, conditional on the target model we give, so if we don't specify an effect operator for the optimal model, we might not be able to fit it after the fact, because they might be aliased with other effects that we have. These space filling runs act as something of a catch- all of possible runs, so there's a couple of reasons that we might want to consider space filling runs, but we want to take a look analytically. What's the difference in performance between these, after we run through model reduction, not just at the full model, because we're not just going to be using the full model. That's the design phase question. When we get to the analysis, whenever you're looking at these mixture experiments in JMP, JMP automatically turns off the intercept for you, because if you want to fit the full model with all of your mixed remain effects, you can't include the intercept. You'll get this warning because the mixture- made effects are constrained to one, and they add up to what the intercept is, so they're aliased. Also, if we want to look at higher order effects, we can't be like a response surface where we have pure quadratic terms. We have to look at these Scheffé Cubic terms because if we try to look at the interactions plus the pure quadratic terms, then we get other singularities. Those are a couple of wrinkles in the analysis. However, going forward with the model selection methods, Forward Selection or Lasso, which are the base methods of the SVEM methods that we're going to be looking at, we want to consider sometimes turning off this default 'no intercept' option. What we find is for the Lasso method we actually have to do that in order to get reasonable results. After we fit our model, now we want to do model reduction to kick out irrelevant factors. We've got a couple of different ways of doing that in base JMP . Probably what people do most frequently is they use the effect summary to go backwards on the p- values, and kick out the small p- values. But this is pretty unstable because of the multicollinearity from the mixture constraint, where kicking out one effect can drastically change the p-values of the remaining effects in the design. What this plot shows is if we go backwards on p- values, what is the largest p-value that's kicked out and we see some jumping around here and that's from that effect. Given that kind of volatility in the fitting process, you can imagine if you have small changes in your observed responses, maybe even from assay variability, or any other variability, just small changes can lead to large changes in the reduced model that is presented as a result of this process. That high model variability is something that'd be nice if we could average over in some way, in the same way that maybe with Bootstrap for est we average over the variability of the CART methods or the partition methods and that's what the SVEM methods would be looking at doing. In the loose sense, they're kind of the analog for that, for the linear methods we're looking at. Also we can go to Step wise and look at min AICc, which is maybe the preferred method. In our last slide of the show today we'll be taking a look at, for the base JMP users, AICc versus B IC versus p-value selection with our simulation diagnostics. Give credit to a lot of the existing work and leading up to the SVEM approach. These are some great references that we've used over the years. Also want to thank Chris Gottwalt for his patience and answering questions and sharing information as they've discovered things along the way. That's really helped set us up to be able to put this to good effect in practice. Speaking of in practice, where have we been using this quite a bit over the years is the setting of liquid nanoparticle formulation optimization. What is lipid nanoparticle formulation? Well, a lipid nanoparticle, if you've gotten any mRNA COVID vaccines Pfizer, Moderna, then you've gotten these. What these do is you take a mixture of four different lipids, summarized here, and they form these little bitty bubbles. Then those are electrically charged, then they carry along with them a genetic payload, mRNA, DNA, or other payloads that either act as vaccines or can target cancer cells. The electric charge between the genetic payload, and then the opposite charge in the nanoparticle is what binds together. Then we want it to get through the cell and then release the payload inside. The formulation changes depending on what the payload is. A lso sometimes we might change the type of ionizable lipid, or the type of he lper lipid to see which one does better, so we have to redo this process over and over. For the most part, the scientists have found that these ranges of maybe 10-60% for the lipid settings and then a narrower range of 1-5%. That's for the most part the feasible range for this process. That's been explored out, and that's what the geometry we want to match here in our application is. We want to say, given that structure that we're doing over and over, do we have an ideal analysis and design method for that? A lso we want to set up a simulation that if we're looking at other structures, other geometries for the factor space, maybe we can generalize to that, but that's going to be our focus for right now. Given that background, I'm going to let Jim now summarize the SVEM approach and talk about that. Yes, thank you. This particular discovery presentation, what we're going to do is a little bit more centered on PowerPoint, unfortunately, because the results are really what this is all about for the simulations that we've done. But this particular session, I will show you some of the JMP 17 that we have new capability. If I go and I want to set up a Space Filling Design... Now, previously we weren't able to do mixture models with Space Filling Designs right from out of the box, if you will. We certainly could put constraints in there, but now what we want to do is show you how you can do a Space F illing Design with these mixture factors. This is new that I now have these come in as mixture, which is good because it now carries all of those column properties with it as well. One thing worth mentioning right now is the default is you're going to get 10 runs per factor, so 40 runs in a DoE typically is good and we are happy because our power is well above 90% or whatever our criteria is. But that's not the case [inaudible 00:09:54] these mixture models because there's so many constraints inherent in it. What that is telling me, unfortunately, is even if I were to have 40 runs, I'd still only have 5% power from doing Scheffé Cubic and even if it's main effects only, there's only 20% power. Power now is not really a design criteria that we're going to look to when we do these mixture models. Now, typically in our applications, unfortunately, we don't have the luxury of having 40 runs. In this case we'll do 12 runs and see how that comes out. We'll go ahead and make that Space Filling Design, and you can see that it's maybe evenly spread throughout the surface. Of course, we do know that we're bounded with some of these guys here that we can only go from 1 -5% on the polyethylene glycol. What I want to do now is just go to fast forward. Now let's say I've run this design and I'm ready to do the analysis. This is where SVEM is really made huge headway and if you listen to some of Chris and Phil Ramsey's work out there on JMP community, you'll see this is a step change. This is a game change in terms of your analytical capability. How would we do this in 16? In JMP 16 what we'd have to do is we'd have to come through, and actually it's worth going through the step just because it gives you a little bit of insight, though the primary mission of this briefing or this talk is not SVEM, it will give us an idea of what's going on. What we can do here is we can go ahead and we'll make the Auto validation Table. This is the JMP 16 methodology. What you'll note here is we've gone from 12 runs to 24. We just doubled them and you see the Validation Set. The training may be the first 12, validation the next 12. That's what's added and then we have this weight. This is the Bootstrap weight and it's fractionally weighted. What happens is we will go ahead and run this model and come up with a predicted value, but then we need to change these weights and then keep doing this over and over for our Bootstrap, much like a random forest, idea for the SVEM. Now what is useful is to kind of take a quick look. What is the geometry of these weights? We can see they're anti- correlated, meaning if in the training set that I'm low, I'm probably going to be high in the validation set. This is kind of a quick little visual of that relationship. Now I'm ready to go do my analysis in JMP 16. It would be analyzed and we'd just do our fit model. Of course, we want a generalized regression and we'll go through and do a Scheffé Cubic here, because it's a mixture. But here's where we have to add in the step, we put the validation set as your validation column and then this validation weight is going to be that frequency. Now I can run this. By the way we could, in many of our instances we're not normal, we're log normal, we could put that in right there. Here we have our generalized regression ability to go ahead and run this model and voila, there are the estimates. What we would do then is we come here under the Save the Prediction Formula. Then here is one run. Okay, so we got one run. You can see that the top is 15.17, and we actually saw 15.23, so not bad in this model, but we would do this over and over. We used to do it about 50 times or so. But with JMP 17, now this whole process is automated to us. We don't have to do this 50 times and then take the average of these prediction formulas. We're able to go directly at it. If I come back to my original design here with the response, I can get right at it. By the way, this is showing that I have another constraint put in here. A lot of times we have the chemists and biochemists like to see that to make sure that the ratios based on molecular weights are within reason. Not only do we have the mixture constraints, we also have a lot of other constraints. I'm working with a group where we have maybe 15 different ingredients and probably 30 constraints in addition to the mixture constraints, so these methods work and scale up, probably is the best way to say it, pretty well. Now this is 17, so 17 I can get right at it. I'm going to go ahead into Fit Model, and I'll go ahead and do a Scheffé Cubic. From here, what we're able to do is come into a generalized regression. In this case, we don't need to worry about these guys in here. We can change it to log normal if we so desire. One of my choices in the estimation instead of Forward is in fact SVEM Forward, so I do SVEM Forward and I'm going to go do 200. You'll see how quickly they have this tuned. Really the only thing you can do in advanced controls is check whether or not you want to force a term in. I hit Go and instantaneously I've done 200 Bootstrap samples for this problem. Of course, I now can look at the profiler and that is the average of the 200 runs. That's kind of my end model, if you will. Of course, with Prediction Profiler, there are hundreds of things you can do from here and Andrew will touch on a couple more of those. But two other things worth noting here, I'll save the Prediction Formula as well and take a look at that guy. When I look at the Prediction Formula, I'll note that it is in fact already averaged out for me here. This is the average of the 200 different samples that are out there. With that, that is the demo, and we'll go back to looking at the charts there to say, "Well, What is it that we're seeing in terms of the results of SVEM?" Andrew, if you want to pull up that slide. This is maybe a quick visual. You can see that if I look at those first three, in this case, red is bad. What we're looking at here is the nominal coverage. This is a mixture model at the predicted optimum spot of performance. We can see that the standard Step wise guys are not doing too well. That's the backward and forward AIC. This is the coverage rates. We'd like it to be a nominal 5% error that we don't actually see the true response in the prediction or actually the confidence interval. In other words, when we looked at the profile that we just saw, it gives us a prediction or confidence interval was the true value, which we know because we're playing the simulation game, right? We know what the true value is, what percentage of time was that in there? We can see that we don't do as well with our classical methods. The full model, putting all the terms in, Lasso does pretty well at a 10% rate or so, but it's not until we get the SVEM methods here that we start seeing that we're truly capturing and getting good coverage. A good maybe picture to keep in your mind, that we are way outperforming some of the other methods out there when it comes to the capability. Now, in terms of how this simulation, what we're focusing on here is a little bit different than what you may think of, in terms of a simulation where we're looking at a model and saying, "How well did we do with this particular method?" We could measure that by how the actual versus the predicted is and then we'd get some sort of a mean squared error. We do track that value, but we find in our work, we're much more concerned about finding that optimal mixture, if you will, with the optimal settings that achieve us a maximum potency or minimizes some side-effects or helps us with this [inaudible 00:18:58] That's going to be called the "percent of max" that we're looking for. We're going to use that as our primary response here in terms of being able to evaluate which methods outperform others. It's not really going to be, how far away am I from the location of the optimal value? It's how far is that response that I predicted as optimum how far is that from the actual optimum? That's going to be our measure of success. The way this will work is I'll be asking Andrew a few ideas here in terms of what typically comes up in practice. I saw the geometry he showed me early on and the optimal design was always hit the boundaries. What if I like things that we call more mix, right? You have more mixed stuff in the middle, space fill, which is better. If I do use an optimal design, it defaults to D maybe, but what about I and A? Then how about the age- old DOE adding center points? Is that smart? Or is one center point? Or how about replicates? We've already discussed how we're not being helpful, so what is a helpful measure of a good design? That's the design piece, but also the analysis piece is, is there a particular method that outperforms everyone, or are there certain areas that we should focus on using Lasso and others, that we should just use SVEMs for selection? These are practical questions that come up from all of our customers, and we'd like to share with you some of the results that we get from the simulation. Andrew, you want to give us a little bit more insight into our simulation process? Yeah, thanks, Jim. Before I do that, I just want to point out one tool that we've made heavy use of in the analysis of our results, and unfortunately, we don't have time to delve into the demo, but it has been so useful is, within the profiler to look at the output random table for these mixture designs and to look at the responses especially, we frequently have potencies by side effects. We have multiple responses, that we want to balance that out with the desirability function, and then we're going to look at the individual responses themselves. When we output a random table, we get a Space Filling Design, basically not a design, but we fill up the entire factor space, and we're able to look at the marginal impact of each of the factors over the entire factor space. For example, for the ionized lipid type, what we'll frequently see is, we can see that maybe one has a lower marginal behavior over the entire space. But since we're wanting to optimize, we care about what the max of each of these is, and one of these will clearly be better or worse. We're looking at the reduced model. After he fits them, we'll go to the profiler and do this. We can still get the analytic optimum from profiler, but in addition to that, this gives us more information outside of just that optimum. What we might do here is for candidate runs, because we always running our confirmation runs after these formulation and optimization problems is we might run the global optimum here for H 102, we might pick out the conditional optimum for H 101 and see which one does better in practice. Also, looking at the ternary plots, if we color those by desirability or by the response, we can see the more or less desirable regions of that mixture space, so that can help us as we either augment the design or either include additional areas in the factor space, or to exclude areas. I can't do much more with that right now, but I wanted to point that out because that's a very important part of our analysis process. How do we evaluate some of these options within this type of geometry of a factor space? We built a simulation script that we have shared on the JMP website, and it allows us to plug and play for different sample sizes in total, how many runs are in the design? We have a true form choice that gives us the true generating function behind the process, a design type, either space filling or optimal. The optimal design now is going to be of a certain minimum size based on the number of effects that we're targeting. Do we have a second -order model, a third -order model, a Scheffé Cubic model? What do we have? Normally, whenever you build a model and custom design in JMP, it writes a model script out to your table and then you use that to analyze it. Well, something we've explored is allowing a richer model than what we get, what we target, and are we able to use these methods with SVEM and get additional improved results, even though we didn't originally target those effects in the design? The short answer there is yes. That's something else we want to consider, so we allow ourselves with the effective choice to include additional effects. We can look at the impact of adding replicates or center points and that custom DWI dialogue to enforce those. How does that affect our response? Because any of the summaries that you get out of the design and out the design diagnostics are beginning targeting the full model, either with respect to prediction variants, D-optimal, you're looking at standard errors for the parameters. But what we really care about is how good is the optimum that we're getting out of this, so that's what we're going to take a look at with these simulations. For the most part in these LMP optimization scenarios, a lot of times, we'll come across two situations. The scientists will say, "I've got about 12 runs available, and maybe it's not that important of a process, or the material is very expensive, and I just need to do the best I can with 12 runs. That's what I've got." Or it might be something where they've got a budget for 40 runs, and they can fit a full second -order model plus third order mixture effects, and we want to try to characterize this entire factor space and see what the response surface looks like over the whole thing. Those are the two scenarios we're going to be targeting in our simulation. Jim, I think you had some questions about performance under different scenarios. What was your first question there? I did. I guess when I think about a 12- run scenario here, and if I just go with the default, I'd get a D -optimal and it would be main effects only. I recognize I could do the space filling like I just did, but my question is, if I do the default, which one of the analysis methods would be preferred? Or is there one? Okay, so taking a look at that. For the D- optimal design, as a general rule, it's going to put almost all of its runs along the boundary of the factor space and it's not going to have any interior runs unless you have quadratic terms or something that requires that. With a 12 -run design, there's 90 degrees of freedom required to fit all the main effects here. We've got a few degrees of freedom for error, but mostly we're only targeting the main effects here. How do the analysis methods do? Is there any difference in the analysis methods? What we do, and all of these we're going to summarize, we show the percent of max for all of our simulations that we do, and so we can see that distribution for each of the analysis methods, all for this 12- run, D-optimal design target effects. Then we also show any significant differences between these, and we're just using students' T. We're not making a two keys adjustments, so keep that in mind whenever you're looking at these significant values. The winner here is our homemade SVEM neural approach because it's not restricted to only looking at the main effects, they can allow some additional complexity in the model, and so it wins here. Now, don't get too excited about that because this is about the best that we've seen SVEM neural do is in these small settings. But if we are running more than one candidate optimum going forward, then maybe we can include a SVEM neural, but in general, we wouldn't recommend only sticking with a SVEM neural just because it tends to be more variable, have heavier low tails. What are the other results? We see the losers in this application or anything that's doing the single shot model reduction because all these effects are significant in the model, and any time we pull one of them out, we are going to get a suboptimal representation of our process. That's why in this case the full model does better than those. But what's interesting is the SVEM linear approaches are able to at least match that full model performance. We're not losing anything by using SVEM in this application, so that's a nice aspect where we don't have to worry about the smaller setting. Are we hurting ourselves at all by using AICC? Now, something else we tried here is given the same, you've only got 12 runs. You're only targeting the manufacture and the D-optimal criteria in the custom DOE. What if we allow the fit model to consider second -order effects plus there are mixture effects, so which our model then was targeted to do? What happens, and we see this JMP here, this SVEM linear methods are able to utilize that information and give us better percent of max, get optimal candidates, and those are our winners here now is these SVEM linear methods. What we see is that interestingly, the base methods for these SVEM approaches, Ford method, or Ford selection or the Lasso are not able to make use of that, only the SVEM is, so that's a nice property. They actually beat out Neural, which is nice because now these are native to JMP 17 and they don't require as much computation time or manual set up as in their own. What we start to see here is the theme that we're going to see throughout the rest of the day is that any of these Lasso approaches with no intercept are going to give us sub -optimal results because without the intercept and the penalization doesn't work right in Lasso, so you actually want to turn off the default option of no intercept if you're going to try to use SVEM Lasso or even just Lasso without an intercept. Okay, so I guess it looks like SVEM Neural did well there. But again, that is not native. We can't do that with JMP 17 Pro, that's not in there . We can, we have to have a manual [inaudible 00:29:01] scripted. Yeah, it's not a manual option. Okay, this is good, but I'm also a fan of the Space Filling Design, so how does that play out in terms of the analysis methods? For the Space Filling Design, you can see rather than having all the points along the exterior, along the boundary, now we fill up the interior space for both the mixture factors and the process factors, which sometimes in practice what we'll do is, we'll take the process factors and round these to the nearest 0.25 or 0.5 or whatever granularity works best for us, but this is what it looks like. In terms of the results, how do they perform? Now what we're going to do is compare the concatenation of the design approach along with the analysis method and see which these do best. Looking at now still allowing the richer second and third -order model for the selection methods and see which one does best. When we look at the comparison, the winners are the SVEM Linear approaches, Lasso only with the intercept, not without the intercept, and the D- optimal. Again, behind the scenes, you have to remember, now you're assuming for this D-optimal approach that your positive model is true over the entire factor space and you've got constant various over that factor space. If you're worried about failures along the boundary, then that's something else to take into account, and it's not built into this. You have to consider that. But if you are confident, maybe you've run this before and you're only making minor changes, then the way to go is the D- optimal with the SVEM approaches. Down here, the losers are the Lasso, with no intercept. We 're going to avoid those, and you can see those heavy tails down here. Not the SVEM Lasso, just the Lasso. Actually here's the SVEM Lasso with no intercept down here. Yeah. They all get these Fs, so they all fail. -Conveniently, [crosstalk 00:30:48] . -Yeah Okay. What often will come up, whether it's designed up front where we've done our 12 runs, and the boss, she has some more questions and we have more runs. If we're going to do five more runs, how does that impact some of these results? When you say five runs, not a follow -up study, but your build is study either 12 or 17 runs in a single shot right now is what you're considering, right? Yeah, exactly. Okay, so yeah, we can look at the marginal impact because there's a cost to you for those extra five runs. What's the benefit of those five extra runs? Using the design analysis you could use, look at the FDS plot and your FDS plus means lower, reflecting smaller prediction variants. Power is not that useful for these mixture effects designs. We don't care about the parameters. We want to know, how well would we do with optimization? That's where the simulation's handy, we can take a look at that. How does your distribution of your percent of max change as you go from 12 to 17 runs? Interestingly, there's no benefit for the single shot 40 ICC to having 17 versus 12 runs. Now, again, right now we're looking at the percentage of max. If you look at your error variance, your prediction variance is going to be smaller, and there might be some other [inaudible 00:32:09] , but mainly your prediction variance is going to be smaller if you look at that. But really, we don't care that much about prediction variance. We want to know, where is that optimum point? Because after this, we're going to be running confirmation runs and maybe in replicas at that point to get an idea of the process and assay variance then. But right now, we are just trying to scout out the response surface to find our optimal formulation, so with that goal in mind, there's no benefit for a four-day ICC. Now for the SVEM methods, we do see there is a significant difference and we do get a significant improvement in terms of our average percent of max we obtain, and maybe not as heavy tails down here. But now you need to know is that you need to decide, is that practically significant? Do you want to move from 90% to 92% mean percent of max in this first shot with five extra runs? You have to do your marginal cost original benefit analysis there as a scientist and decide if that's worth it. Just looking at it here, what I think might be useful because you have to run confirmation runs anyway is if we run the 12 -run design, you can then run a candidate optima or two based on the results we get, and then plus a couple of additional runs maybe in a high -density region for what that looks good, or even augment out your factor space a little bit, and then you're still running a total of 17 runs, but now we're going to have even a better sense of the good region here, so that's something to consider. Something else we can see from running the simulation with 17 runs is, let's look at the performance of each of the fitting methods within each iteration, and there's actually a surprisingly low correlation between the performance of these different methods within each iteration. We can use that to our benefit because we're going to be running confirmation runs after this, so rather than just having to take one method and one confirmation point, one candidate optimal point, if we were to, for example, look at these four methods and then take the candidate optimum from each of them, then we're going to be able to go forward with which one everyone does best. We're looking at the maximum of these. Rather than looking at a mean of 92% to 94%, now we're looking at a mean of about 97% with a smaller tail if we consider multiple of these methods at once. Okay, very useful. Let's now put our eyes toward the 40 -run designs. Very good information in terms of my smaller run designs. Now with 40, how does it play out in terms of these analysis methods? Are we going to see consistent behavior with what we saw in the 12 -run design? Then how about the Space Filling versus the optimal design, D-optimal? -I'd be interested in that. -Okay. Well, first take a look at the D-optimal design, 40 runs, and now we're targeting all of the second -order effects, the third -order effects, mixture effects, and we're targeting all the effects that are truly present in our generating function, and we still see that we're loaded up on the boundary of the factor space with the optimal design, and then if we were going to see now with the space filling design, we're going to see now we're filling up the interior of the factor space for the mixtures and for the other continuous process factors. Let's see what the performance difference is. First of all, focus on the space filling design, which analysis methods do best? And same as we saw in the 12 -run example of the SVEM linear, Ford selection with that intercept, Lasso with the intercept does the best. The worst case you can do is keeping the full model, or then trying SVEM or single shot Lasso with no intercept and the D-optimal setting, same winners, which is reassuring because now we don't have to be worried about, "Well, we're changing our design type. Now we got to change our analysis type." It's good to see this consistency across the winners of the analysis type. The full model doesn't do as poorly here with the optimal design, I think, because the optimal design is targeting that model and the losers here are still the Lasso with no intercept. Then Neural is really falling behind here, behind the other methods. Now let's compare the space filling to the D-optimal designs, and we can really see the biggest difference here is within the full model, the space filling designs are much worse than the D-optimal design. Anytime you're doing design diagnostics, that's all within the context of the full model. For your D -optimality criteria, your average prediction variance, that's all there. A lot of times when you run those comparisons, you're going to see a stark difference between those and that's what you're seeing here. However, in real life, we're going to be running a model reduction technique. With SVEM, even the single shot methods improve it. But especially with SVEM here, it really closes the difference between the space filling and the optimal design, and we see pretty close to medium, and slightly heavier tail here. But now you can look at this and say. "Okay. I lose a little bit with space f illing design. But if I have any concerns at all about the boundary of the factor space, or if I'm somewhat limited in how many confirmation points I can run and I want to have something that's going to be not too far away from the candidate optimum that I'm going to carry forward, then those are the benefits of the space filling design." Now we can weigh those out. We're not stuck with this drastic difference between the two. Again, that's based only versus the D-optimal design . I guess a lot of times in our DOE work, we like to maybe look at the I-optimality criteria and even the A has done really well for us. In particular, it spreads it. It's c ertainly not space filling, but at least it spreads it out a bit more than the D -optimal. Do we have any ideas how those I and A optimal work? Yeah, we can swap those out into simulations. One thing we've always noticed, I love the A -optimal designs in the non- mixture setting. It's almost my default now. I really like them. But in the mixture setting, whenever we try them, even before the simulations, if we look at the design diagnostics, the A -optimal never does as well as the D or the I -optimal, and that bears out here in the simulations, that's the blue here for the optimal, gives us inferior results. Rule of thumb here is, don't bother with the A-optimal designs for mixture designs. Now for D versus I -optimal, we don't see any... In this application for this generating function, we don't see any difference between them. However, a reason to slightly prefer the D-optimal is, there tends to be some convergence issues for these LNP settings where you've got to peg over the one 5% and you're trying to target a Scheffé Cubic model in JMP, so we've noticed sometimes some convergence problems for the I-optimal designs and it takes longer. The D -optimal, if there's not much of a benefit, then it seems to be the safer bet to stick with the D-optimal. Now we weren't able to test that with the simulations because right now in JMP, you can't script in Scheffé Cubic terms into the DOE to build an optimal design. You have to do that through the GUI. We weren't able to test that, see how often that happens, but that's why we've carried forward D-optimal in these simulations and we stick with those. If you want to in your applications, you can try both D and I and see what they look like both graphically and with the diagnostics, but the D-optimal seems to be performing well. Okay, I guess just keep pulling the thread a little bit further is, a lot of times we'll try some type of a hybrid design . Why don't we start out with, say, 25 space filling runs, and then augment that with some D-optimal criterion to make sure that we can target the specific parameters of interest? Does that work out pretty well? Yeah, we can simulate that and we take a look. Either we've got... This is the same simulated function, generating function we've been looking at for you to run the D-optimal, for you to run the space filling, or a hybrid, where we start out with 25 space filling runs and then we go to augment and load in building 15 additional runs targeting the third order model, and what we see is that now, we have no significant difference in terms of the optimization between the 40-run D-optimal and the hybrid design, But in the hybrid design, we get the benefit of those 25 space filling runs. We get some interior runs protection to fit additional effects and protection against failures along the boundary. It's a little bit more work to set this up. We'll do this for high priority projects because only for those because of that extra cost and time. But it does appear to be a promising method. Right. Practically you think about where your optimal is going to be, there's a good chance it could be in that interior space that's not filled in the D-optimal along the boundaries. I guess just maybe going back, revisiting the ideas of what if I had a center point, what if I had a point that I could replicate? Again, maybe on the 40- run design, if I had five more things, so just any other little nuggets that we learned along the way with these? Well, this comes up a lot because now textbook will tell you to add five to seven replicate runs. The scientists are going to kick you out if you try to do that. A lot of times we have to make the argument to add even a single replicate run because it has advantages outside of the fitting because now you get a model [inaudible 00:41:09] and just graphically we can use that as a diagnostic, we can look at that air variance relative to the entire variance from the entire experiment. It's very useful to have, and so it's going to be nice to have an argument for you to say that, "Okay, we're not hurting your optimization routine by including even a single replicate run." That's what we see here for the 40-run example by forcing one of these to be a replicate within custom design. We are not getting a significant difference at all in terms of optimization. It's neither helping or hurting. Let's go ahead and do that, so that when we have that extra piece of information going forward. I don't have the graphs here because it's boring. It's the same thing in this particular application, forcing one of them to be a centerpoint. There's no difference. Part of that might be in this case, the D-optimal design was giving us a center point or something close to a center point. That might not have been changing the design that much. You might see a bigger difference if you go back to the 12-run design enforce the centerpoint. But that's the advantage of having a simulation framework built up where you can take a look at that and see what is the practical impact going to be for including that. Okay, now how about... I mentioned I have this big project with lots of constraints. Would a constraint maybe change some of the results? Well, we could possibly include the constraints and it's going to change the allowed region within the... Graphically, you're to going to see a change in your allowed region, and we can simulate that. Actually, I've done that. I don't have the graph up with me right now, but what it does is there's not that much an impact, SVEM still does well. One difference we did note is that running this simulation and then constraining the region somewhat is that the space filling improved because it's got a smaller space to fill and not as much noise space, but the D-optimal will perform just as well between the two with or without constraint. That was pretty interesting to see. But all of this applies just as well with constraints and nothing of note in terms of difference for analysis methods with the constraint, at least the relatively simple ones that we applied. Right, okay, we're almost running short on time here, Andrew, but I do have a concern. We have a misspecified design and we would like to wrap up and leave the folks with a few key takeaways. Here's an example where now this functional form does not match any of the effects we're considering and we're relatively flat on this perimeter where a lot of those optimal designs are going to be so I'm going to see how that works out. Also note the [inaudible 00:43:52] Cholesterol set to a coefficient zero and a true generating function. Now taking that true function going to profil er output right in the table and you can see how nice it is to be able to plot these things to see the response surface using that output right in the table. Here's really your true response surface, and this is your response surface, but what's interesting is it looks like there's an illusion here. It looks like Cholesterol is impactful for your response. It looks like it affects your response, but in reality the coefficient is zero. But the reason it looks like that is because of the mixture constraint. That's why it's hard to parse out, which the individual mixture effects really affect your response. We're not as concerned about that as we are of saying, what's a good formulation going forward? In this setting, we add a little bit of noise, 4% CV, which is used frequently in the pharma world. In this case, the mean we're using is the mean at the maximum, which in this case is one, and then also a much more extreme 40% CV. This looks more like a sonar and they're trying to find Titanic or something. Hopefully none of your pharma applications look like this, but we just want to see in this extreme case how things work out. What we see is in the small 12-run example with relatively small process variation is process plus essay variation is these baseline designs went out and SVEM, all the same methods, and then if we go up to 40-run, the space filling isn't able to keep up as well, but D-optimal will really do better now, even though it's relatively flat out there and the size where most of the runs are, it's able to pick up the signature of the surface here. Now, here's the difference between the full model and then the space filling and the D- optimal. Not as big of a difference for the SVEM methods, but you do still have a few tail points down here. Then they're all not performing as well as the SVEM linear, even though the SVEM linear is only approximating that curvature for that response surface. If we go up to the super noisy view, no one does a really good job, but still your only chance is with the space filling approaches. But then when we go up to the larger sample size, even in the face of all the process variation, process noise, is now the option was able to bounce out over that noise better and is able to make better use of those runs than the space filling. A couple of considerations there. What's your run size? How saturated is your model? How much process variation do you have relative to your process mean, goes into the balance of the space filling versus optimal. If we take a look at what are the candid optimal points we're getting out of the space filling versus optimal. I'm sorry, for the space filling, then what we see is we're on target for this is ionizable and helper. We're on target for all of our approaches except for these last with no intercept. They're never on target, they're always pushing you off somewhere else. You can see graphically how that lack of intercept. Now, if we allow the intercept, then we're on target. That really is important to uncheck that no intercept option for lasso. For all the people that are not using JMP Pro and don't have SVEM, you might say, well, okay, what's your simulation? Here, what is better? AICc, versus BIC, versus P-value. Unfortunately, just using the number of simulations we've run, there's not as consistent approach as there is with SVEM. If you've got a large number of runs, where there is either specified or correctly specified or misspecified, the forward or backward AICc do well. Full model does worse, whereas in the smaller setting, the full model does better because all those terms are relevant. Also the P-values here, too. Now, you see, 0.01 does the worst, 0.01 does the best in large setting. Not consistency, what P-value do you use? 0.01, 0.05, 0.1 . The P-value from this view is an untuned optimization parameter, so maybe best to avoid that and stick with the AICc if you're in base JMP. However, we have seen now that the SVEM approaches for these optimization problems do give you almost universally better solutions than the single shot methods. You can get better solutions with JMP P ro, with SVEM. Great. I guess we want to just wrap up. Some of the key findings here, Andrew. Yeah, and also, Jim, any other comments? Do you have any other comments too about these optimization problems or anything? Interesting things we've seen recently? We have, we're up against time for sure, but we've done some pretty amazing things that we've come up with new engineered lumber that's better than it's ever been and propellants that are having physical properties and performance that we haven't seen before. We have taken a step, a leap in terms of some of the capabilities that we've seen in our mixture model. Can we summarize with the highlighted bullet down there, that SVEM seems to be our way to go, and if you only had one maybe SVEM forward selection, you'll be covered pretty well. Yes, that's right, because I'm always scared. Even though the last lasso sometimes it looks less with intercept, sometimes it looks slightly better. I don't know if it's maybe one or two cases were significantly better, but always neck and neck with forward selection, but I'm always scared that I'm going to forget to turn off no intercept and then give myself something that's worse than doing or as bad as doing the full model. I'm always scared of doing that. SVEM forward selection with de fault setting seems like a good safe way to go. Perfect. Well, with that, we stand ready to take your questions.
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Using JMP to Analyze and Speed-Up a Milling Operation During Process Scale-Up (2022-US-30MP-1139)
Monday, September 12, 2022
Many industries (particularly pharmaceuticals) use milling processes to reduce the particle size of key raw materials. The aim of the milling process is to reduce the average particle size and achieve a targeted particle size distribution (PSD). While there are many types of milling techniques, the key performance indicators (KPIs) are typically mill time, PSD, average particle size, and other industry-specific targets. Bringing a milling process from lab scale to manufacturing scale can present many challenges. Process factors, such as heat transfer, mass transfer, milling times, and additive amounts, can all substantially differ from the small-scale process. Thus, a thorough understanding of the milling process is required to maximize a successful scale-up. This talk begins with a description of typical challenges seen in milling scale-up operations. We follow this with an analysis of a sample data set that demonstrates how JMP can be used to quickly and efficiently resolve these problems. We use data visualization, definitive screening designs, augmented DOEs and functional data exploration to help with the scale-up process. -Hi, I'm Jerry Fish I'm a technical engineer with JMP. I cover several Midwestern states in the US. I'm joined today by one of my colleagues Scott Allen, who's also a technical engineer for JMP and he supports several other Midwestern states. Hi, Scott. -Hey, Jerry, good morning. -Good morning. Today we want to talk about a variety, a relatively new way to analyze data, specifically from milling operations, where we want to learn about milling to help with a scale- up milling process. -W e both have a strong interest in process optimization using DOE and in modeling response curves using the Functional Data Explorer in JUMP Pro, and we wanted to bring those together for today's presentation. And one of the first examples I saw doing this sort of analysis was the milling DOE that's in the sample data library, where the goal is to optimize an average particle size. So as we were talking about possible topics for today, we thought it would be interesting to see if we could extend that instead of optimizing that just the particle size, could we actually optimize the particle size distribution response curve? -So, milling has many different applications. You'll find it in anything from mining, food processing, making toner in the printing industry, making pharmaceutical powders. At the most basic level, a milling process is used when we want to reduce the particle size of a certain substance and produce uniform particle shapes and size distribution of a starting material. Often some type of grinding medium is added to accelerate the process or to control the size and shape distribution of the resulting particles. In each application the desire is to get the right particle size, say the median or the mean particle size, with a controlled predictable particle size distribution. In the scenario we discussed today, we have an existing manufacturing milling operation that produces a pharmaceutical powder. It has good performance today, creating the right medium particle size and a narrow particle size distribution. A full disclosure this scenario and the resulting data are invented. Scott and I didn't have access to non- confidential data to present in this paper. However, even though the data are fabricated, the techniques that we're about to show are applicable for real world problems. The picture on the left shows a typical agitated ball mill. The material to be milled, enters at the top and continuously flows into the vessel where an agitator rotates the material and the grinding medium to affect particle size. The resulting particles are then vacuumed out of the vessel as they're milled. Management has said they need to increase the production output, something we're all, I'm sure, familiar with. They are considering building a new milling line, but before investing all that capital, they'd like to investigate, can we simply increase the throughput of our existing equipment? So manufacturing made some attempts at doing that, they adjusted their process, and while they can affect the median particle size, the new output has odd particle size distributions. So manufacturing came back to R& D where Scott and I are, and asked us to go to the pilot lab and see if there's any combination of settings that might improve throughput. Scott, what parameters did we look at? -In this case, we're going to use six different factors for a DEO. There's going to be four continuous factors so agitation speed, the flow rate of carrier gas, the media loading as a percentage of the total, and then the temperature of the system. And then there's two categorical factors, the media type and maybe the mesh size of a pre screen process. And so determining those factors is fairly straightforward these are known to affect particle size distributions and things like that, but the response is still a challenge. And in this case, I'm not sure how you would actually go bout doing this if you couldn't model the response curve like we're going to do. So Jerry, in your experience, how would you have done this before? -S o this next slide shows typical particle size distribution or particle size density plot. And we've got a plotted as percent per micron versus size. But you can think of this as just a particle count on the y- axis or a mass distribution, it's just a histogram representing the distribution of particles. What you're seeing on your screen, this might be a good particle size distribution, has a nice narrow shape with a peak at the desired median particle size. But how do we characterize this distribution if we want to do a test to adjust it? Well, we might characterize the location with the mean, median, mode of the distribution. And we might characterize the width via a standard deviation or maybe a width at half peak height, those are typical ways we might measure that. But when manufacturing tried to turn various production knobs in their process to speed up the throughput, they saw varying degrees of asymmetry in the distribution. Maybe this was due to incomplete milling of the pharmaceutical material, or perhaps there were temperature effects that caused particles to agglomerate, we don't really know. But now the half width isn't really representing the shape of that new curve. So we might turn to calculating maybe percentiles along the width of the curve, maybe the 10th percentile of particles that fall below a certain point, 20% below this point, 90% below this point, and so forth. But it gets even tougher when we're trying to describe something like this shape where there are two very pronounced peaks, or this shape, which I tend to call a haystack where it's very broad, doesn't have tails. What do we do with that? So this parameterization technique doesn't seem to be the best way to approach the problem. Scott, I know we have to do some experimentation, but how are we going to approach this in today's analysis? -So that's what we're going to do. So we're going to use the entire shape, our response in this case is going to be that curve. So we're not going to try to co-optimize all those different parameters that you talked about. We could co optimize two, three, four different parameters, but instead we're going to use that entire curve as our response. We're going to use all the data and then our target is going to be some hypothetical curve that we want to achieve. So once again, we're not going to try to target all the different parameters in that curve, we're going to try to match the shapes. So we want to have our experimental shape match the shape of our target. And so that's how we're going to get started with the analysis today and we'll take you through the workflow of how we would do that. So let me go and get into JMP. Oops, there we go. So we first see here is the DOE that we ran. So we ran a definitive screening design with those six factors, although you could use any design that you wanted. And we've got 18 experiments in this case, so here's all the factors and the factor settings that we used. -That looks pretty standard to me, Scott, for the DOE that I've run in the past. But you don't have a response column. -W ell, that's one of the unique things about the response curve analysis is in some cases you set it up a little bit differently and how you do the analysis. So we don't have a response column and we're not going to optimize just a single value. Instead, our responses are in this other table. So in this case, we've got a very tall data set with the x- axis is our size and the Y value is our percent per micron and this is what we're going to plot and optimize. But we do need to get our DOE factors in there. So to do that, we just took a little bit of a shortcut and we did a virtual join between these two tables. So in our design here, our run number is the link ID. And then we've got the run number here and this is the link reference, and this lets us bring in all of those DOE factors. So these are all here in the table, but they're just virtually joined and that helps us keep our response table nice and clean. So if there's any modifications, we don't have to worry about copying and pasting or adjusting all those DOE factors. So that's how we set up our table, we've got our DOE factors in this table and our DOE responses occurs in this. And as you can see, we've got all of our 18 runs here. And before we start the analysis, let's take a look at those curves. So I just plotted all those curves in Graph Builder, and so we've got our target curve here. So this is a hypothetical target curve that we want to achieve, it's going to be experiment number zero. And then we've got experiments one through 18 and the response curves for each of those. -So you've run 18 experiments and not a single one of those looks exactly like that target. What do we do with that? -That's a good observation. And in this case, what we can see are some different features between these, so definitely some are more broad. I like how you called it, that haystack here. Some are more narrow, maybe with smaller shoulders here. We do see that the peak shifts a little bit in some of these, here's the peak, it's shifting left and right. And so hopefully we can find some settings and those factors that will use the best of all of these give us something that's narrow without a shoulder or bimodal peak. But to do that, we need to go into the Functional Data Explorer. This is a traditional DOE, we would go up to analyze and we would go to fit model, potentially. In this case, we're going to go down here to specialized modeling and go to Functional Data Explorer. And so when we launch this, we need to add our Y values, which were the percent per micron. The X values was our micron size. We need to identify each of those functions with the run number. And then we're going to add all those DOE factors as supplementary information. So I'm just going to take all of my DOE factors and add them as supplementary information. Now we'll click okay. So when you launch the Functional Data Explorer, this is what you get first. And this is just a data processing window. And what we're doing is just taking a quick look at all of our data. And so in this initial data plot, we just have all of our curves overlaid, and you can see our green target curve hiding in there. So this just shows us how all of our data are lining up. Over here on the right, we have a different set of functions to help clean up the data, process it. And one of the really nice things about this platform is you don't have to do that data processing in the data table. So if you needed to remove zeros or do some sort of adjustment here, standardized the range, things like that, you can do all of that over here in the Clean up. In our case, our data is pretty clean, so we don't need to do any data processing. But what we do need to do is take this green curve, our target curve, out of the analysis. So this is the target, this is what we're going to try to match. And so we don't want it to be part of the modeling analysis. So to take that curve out, we go over here to the target function and we click load. And I'm going to select that zero curve, click okay. And now it's gone. So now we're not going to include that in our models. And so we can scroll down here and just see how each of our individual experiments are plotted. So now that our data is nice and cleaned up, we can go on to the analysis. So to do the modeling, we go up to the red triangle and there are several different models that we can choose. And in a typical workflow, at the beginning, you might not know which is the best model to use, whether you're going to use a B-spline or a P-spline or something else. In this case, in the interest of time, we've done all of that already. And we know that the P-spline gives us a pretty good model. So we're going to go ahead and fit that model. So I select P-spline and now JMP is creating the models. And what we'll see is, the first thing we'll is something similar to that initial data window over here. So all of our curves are still plotted and overlaid. But now we've got this red line and this red line is representing the mean so the mean curve of all of those different curves. And so we can also scroll down below and we see each of our individual experiments also with a line of fit. And this is the first indication that you can get about how well this model is fitting. So if you're getting all these red lines overlaying your experimental data, then you're probably on the right track. If there was a lot of deviation then you might consider doing a different model. Other thing you'll notice over here is there's different fitting functions that the spline model is using. In this case, there were two that JMPs that are pretty good. So this linear model and the step function model and by default all the analysis down below is going to use the model that had the lowest BIC value. So in this case, all the analysis is using this linear model. But if you wanted to use a different one you just select it and I don't know if you can see it easily, but this one's highlighted now or you go to the linear or you can just click on the background and it'll go to the default. And so that's the modeling side of it. But we need to check how well this model is fitting. And so to do that, we just go down here to the window that has a functional PCA. So this is the functional principle component analysis. This looks a little complicated, but what we want to do is really start to take a look at how well this model has been created. And so what we want to do is look at this mean curve here. So this is the same mean curve that was calculated in the section above. And what JMP has done is said, we're going to start with this mean curve and then we're going to add a shape. So we're going to add this function or some portion, either positive or negative portion of this curve to this mean curve. And you can see over here how much of the variance you can explain with that one curve. So in that case, if we just had our mean curve in our first shape, we would explain about 50% of the variance. By adding a second function, now we're explaining nearly 79%, 3rd function gets us up to 88%. And so you can see how much of that variation we can explain by adding more and more shapes. And depending on the type of curve you have, you might have only one function or you might have dozens of functions depending on what that curve looks like. So this is it looks like we can explain a lot of the variance here. It takes us about nine functions to get up there. But now we want to see how well those combinations of all those shapes with the mean function or with the mean curve, how those are represented. How representative they are of our experimental data. So to do that, we're going to go down to the score plot, and I'm going to make this just a little bit smaller. And we're going to look at the score plot and this FPC profiler. So the FPC profiler is a way to show the combination of all those different shapes. So we really just want to pay attention to this top left part of the grid. So this is our experiment based on the combination of the mean curve with all those different shapes. And right now, all of the FPCs, since they're set to zero, we just get that mean curve. But if I start adding that first FPC, if I make it more positive, I'm adding that shape, now I can see how my modeled shape is changing, and if I go lower, I can see how it's changing. So by adding and subtracting each of these different shapes, I can recreate all of the different curves or get close to all those different curves. So this might take a little while to do manually, but there's a nice little shortcut. So what I like to do is go into this score plot, and let's say I want to see experiment number six so I can hover over six, and then I'm going to pin that here, pull it over. And so there are nine different functions, but we're only going to see two at a time. And I can see that component one is 0.08 and component two is minus 0.03 . So I can take this to 0.08 , and I can take this to minus 0.03 . And I'm starting to reproduce this curve, but I would need to adjust all of them. And so there's a shortcut to do that by just clicking on this. So by clicking on six, all of the different FPC components are set to the best representative model. And we can look at these two shapes and see how close they are. In this case, they look pretty good. Maybe there's not as much definition, and this is not very straight, but it looks pretty good. And so we can go over to another curve like number seven, we'll click on that one and see how it changes. Now it's not looking quite as good, and we can go to eight. And this is what I really like about this platform, is it lets you explore the data. So it's Functional D ata Explorer, we're just seeing how well this model fits, and we're doing it fairly visually. And right here, if we're really interested in that understanding the bimodal nature we're not getting that resolution with here. So this is telling us maybe this isn't the best model. Maybe there's a better one out there that we can look at. So if we go up back to the top, the linear model was selected initially because it had the minimum BIC, but maybe we want to use a step function so I can click on the step function. And now all those FPC curves have been recalculated. And the first thing we notice is, we're getting a lot more explanation of the variance here. So we don't necessarily need all of these curves, I can just take this slider. Maybe we just want to look at six curves and explain 99.7% the variance. And so it's simplifying the model a bit. So now we can go down here and take another look and spot check those curves. So I can hover over six again, pin it here. This is the curve that we'll be looking at. And what I want to do, I'll just make this a little bit bigger. And so when I click on six, okay so what do you think, Jerry? What do you think this one is looking a little bit better? -That's much better reproduction of your experimental day? Yeah, I like that. -Good. Yeah, I think this is looking better. So we can go to seven, and that one's looking a lot better as well. And you don't need to select them all, but it's good to check a few of them so we can go look at eight. Yeah, so now we're getting a lot better resolution here on the bimodal nature of it. All right, I think this is telling us that this model is pretty good. -Yeah, so what do we do now? That's great that you can reproduce the experimental results, but how do you get to the optimal? -Yeah, I guess at this stage, it's still a little bit abstract. So we've got all these different shapes that we're combining in different ways to reproduce all of our curves, but we haven't done what we set out to do which was relate those shapes to our DOE factors. So that's what we're going to do next. We're going to go back up to the model and we're going to select functional DOE analysis. And when we do that, now we're getting a profiler that might look a little bit more familiar if you're used to doing traditional DEO. So once again, the response curve that we have is here. And so we see our percent per micron on the Y and the micron size or the particle size on the X. But now instead of having those FPCs in those shapes, now we have our DOE factors, so we've got our agitation speed, our flow rate, media loading, et cetera. Now I can move that agitation speed and I can see how it's relating to or how it's influencing the curve. And I can see by the slope of these lines whether or not something is important or not. So changing that one doesn't really change the shape. So flow rate doesn't matter a whole lot, but temperature certainly makes it more broad or makes it more narrow. And so what we can do because we loaded that target curve, just like in a standard deal, we can go to our red triangle and we can go to maximize desirability. So typically, this would look at a parameter if you were doing a traditional DOE. But in this case, it's going to try to find the settings that match that target curve that we loaded earlier. So when I click that, and there we go. It looks like there are some settings here that get us a curve that's fairly narrow, hitting the peak that we wanted and doesn't have any of those features that we're trying to avoid. -Very cool. So are we done? We've got the settings that we need, we just throw those over the fence manufacturing and we're done. -Well, that's one way to doe it, Jerry. I don't know if folks in manufacturing that I worked with before might not like that. We probably want because these settings were set at, the settings are not part of our design. So this one is in the center, this one's not at an edge or the center. So we probably want to run some confirmation runs here, maybe see some sensitivities and make sure that,, we've got some robustness around these settings. -Very cool. Okay, all right. -Let's get back and I think we can sum up. -Yeah. So, Scott, that was great, thank you for that presentation. So in summary, what we've tried to do, is demonstrate how to perform a DOE using these particle size density curves. The curves themselves as the response rather than parameterizing the PSDs with summary statistics like median, standard deviation, et cetera. We were then able to optimize our factor input settings to the process at least at the pilot scale to find an optimal curve shape that was very close to our desired particle size distribution. Along with the way we discovered how some of those parameters, agitation speed and so forth affect the particle size distribution, leading to multiple peaks or leading to broad peaks or whatever that might be. So we have an understanding about that, and we have a model. So bringing this all back to our original scenario, R& D took the results back to manufacturing, where confirmation runs were attempted. Scale- up perhaps wasn't completely successful. That's typical of scale- ups, sometimes the pilot runs don't map directly to manufacturing. But we do have this model now that gives us an indication of which of these knobs to turn to adjust if we do have a shoulder on that peak or something like that. So we were able to go back to manufacturing, give them the assistance that they needed to get that in so that they could increase their throughput and everyone was happy. [crosstalk 00:26:18] That concludes our paper. Scott, thanks for all the hard work. -Yeah, well, it was great working with you on this, Jerry. -Yeah, likewise. Scott has been kind enough to save the modeling script in the data table, which we're going to attach to this presentation. If you've got any questions about the video or any of the techniques that we did, please post your comments below the video, there'll be a space for you to do that, we'd be happy to get back with you. Thank you for joining us. -Yap, thanks.
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Latin Squares, Youden Squares, BIBDs and Extensions for Industrial Application (2022-US-45MP-1138)
Monday, September 12, 2022
Latin Squares are beautifully symmetric designs employing a given number of symbols in the same number of rows and columns. Each row and each column has all the symbols. A Youden Square, attributed to Jack Youden, is a Latin Square from which a number of rows have been removed. The Youden Square is also a special case of a balanced incomplete block design (BIBD), which is one of the design tools in JMP. All these designs have been around for decades. They support a single categorical factor and one or two blocking factors. They are commonly used in agriculture where the rows and columns are rows and columns of plants and it is desirable to remove any fertility gradients in a field in the analysis. In industrial settings, it is unusual for experiments to be limited to only categorical factors and blocking factors. However, it could be very useful to use these designs as building blocks for creating experiments in an industry with more factors. This talk will show how this can be done in JMP using a combination of the BIBD designer and the Custom Designer. Hi, my name is Bradley Jones. I am the leader of the JMP DOE Group at JMP statistical software. I'm going to talk to you today about Latin Squares, Youden Squares, Balanced In complete Block Designs, also known as BIBDs, and some E xtensions for Industrial Application. Let's get started. What you see here is a window in Cambridge University that shows depicts a Latin Square. The window has seven rows of colored panes, seven columns of colored panes, and seven different colors. You can see the yellow, blue. You can see that yellow appears in every row, in every column, as well as all the other colors also appear in every row and every column. There's this beautiful symmetry to the Latin Square that we see here. As a designed experiment, Latin Square is primarily an agricultural design. You can think of the rows and columns as blocking factors. For instance, if you were doing the Latin Square design in agriculture, the rows and columns would be rows and columns of plants. What you're doing when you make those blocking factors is ruling out any effect of a gradient of fertility in a row across rows or across columns. But then the entry in each row or column is one level of a categorical factor, that in the case of the Latin Square design, the categorical factor has seven levels that correspond to the seven different colors. You can see that each of the two blocking factors both have seven levels, and the categorical factor also has seven levels. There are three factors, and each of them have seven levels. One might say, well, first, it's rare that in an industrial experiment, you would have three seven-level categorical factors, or even one categorical factor at seven levels, and two other factors that were blocking factors at seven levels. That makes a lot more sense in agriculture. H ere is an example of a Latin Square design that I created using the Balanced Incomplete Block Design tool in the DOE menu. You can see that there are seven blocks and seven levels, A through G, in each row and also in each column. The Youden Square is a Latin S quare with some rows removed. For instance, what you're seeing here is a transpose Youden Square. If you turned it on its side, you would see that there are seven blocks, each of which has four levels. Basically, this Youden Square is created by just removing three rows from a Latin Square. This is a little bit more like something that you might enjoy doing in an industrial setting. Imagine that you were doing an experiment that you are going to run on seven days. Each day, you could do four runs. You would have 4 times 7 or 28 runs in all. Each of the days, you would be doing four of the seven levels of some treatment factor. But the Youden Square is not really a square. It's more like a rectangle. I don't know exactly how it came to come to have that name, but it's also a special case of a Balanced In complete Block design or BIBD. The Youden Square is actually, I mentioned it only because I've been asked to give the Youden a lecture at the Fall Technical Conference this year, in October. I wanted to show something about Youden since I'm doing that lecture. But I really want to talk more about Balanced Incomplete Block Designs, or B IBDs, because they are more general type of design. In this case, we're thinking about a seven-level categorical factor. You can only do four runs a day, but you worry that there might be a day- to- day effect. The four runs a day are a blocking factor. You have a seven- level categorical factor that you're interested in. Again, this is the same scenario as you would have with a Youden Square, except that there are a lot more possibilities for creating Balanced Incomplete Block D esigns than Youden Squares. Here's an example of that. Here's the B IBD with each block having four values. There are seven blocks. You can see that... Here's the Incidence Matrix. What the Incidence Matrix is, it shows a 1 if that treatment appears in that block. The first block, A is in the block, C is in the block, F is in the block, and G is in the block, and you can see A, C, F, and G. Now in this design, each of the seven levels of the categorical factor appears four times. You can see that in this Pairwise Treatment Frequencies. Also, each level of the categorical factor appears with another level of the categorical factor two times. Level A appears with Level B twice. Here's one case here, in Block 2 and also in Block 5. For every pair of factors, they appeared in some block with any other level twice. Now, in fact, there's one more cool thing about this design, which isn't always guaranteed to happen, but in this case, it does. Each treatment appears once in each possible position. For instance, Level A appears in Block 2 in the first position, in Block 5 in the second position, and Block 6 in this third position, and Block 1 in the fourth position. A ll the other levels of the various treatment effects appear once in each position. That means that the position, if you wanted to make position a variable, you could have its orthogonal to the blocking factor and also orthogonal to the treatment factor. You can actually, in this case, have a seven-level treatment factor, a seven-level blocking factor, and a four- level position factor. Imagine that you are going to, again, do this experiment in seven days with blocks of size 4 in each day, and then in each day, you would control the position that each treatment appears in so that the position effect wouldn't bias in any other main effect of the design. What I just talked to you about in the BIBD is that in this BIBD, there are the seven-level categorical factor, that's a treatment factor. There are seven different possible treatments that you might have. You could imagine that you could have seven different lots of material, for instance. You would think of the lot of material as being different lot of material, might be a different treatment. The blocking factor is day. You're going to run the experiment over seven days, and you're going to run for four runs in each day. Then within each day, there's a position. The time order of the position of the run isn't going to affect any other estimate of either day or treatment. Now, in industrial experiments, having only a categorical factor and two blocking factors is a rare thing. I'm thinking, what if I wanted to add some factors to this experiment, say, four continuous factors? I can make design with four continuous factors, a seven-level categorical factor, a seven-level blocking factor, and four- level position factor using the custom designer. But I wouldn't necessarily get that beautiful symmetric structure of the BIBD on the categorical factors and the blocking factors. S uppose I want to keep that beautiful structure and just add the four continuous factors. That is an extension of the BBD that might be more appropriate to an industrial experiment. Here's an example of that. I have four continuous factors. I have 6 degrees of freedom for blocks, 6 degrees of freedom for treatment, and 3 degrees of freedom for order. Because in a categorical factor, you have one fewer degrees of freedom, then you have levels. You can see that the main effects of the continuous factor are all orthogonal to each other. They're orthogonal blocks, they're orthogonal treatments, and they're only slightly correlated with the order variable. Let me point out to... I'm going to leave the slideshow, and move to JMP here. Here is the JMP BIBD capability. You can find it under special Purpose, Balanced Incomplete Block Design, so DOE then Special Purpose and Balanced Incomplete Block Design. I chose that. I defined a treatment variable that has seven treatments, A through G. I made the block size here. Let's suppose I want blocks of size 4 and seven blocks. That's my design here. Now I have the picture that I showed you before in the slideshow. Here's the blocking factor as seven blocks. Each has four elements. This is the Incidence Matrix, which shows which treatment is applied in which block. If it's applied, it's 1, and if it's not applied at 0. You can see that each treatment appears four times in the design, in each treatment, or each level pair appears twice in some block of the design. Finally, we have the Positional F requencies that shows each treatment appears in each position in the design. Here's the table of the Balanced Incomplete Block Design. Now what I want to do is I want to create a design experiment that forces this set of factors into the design. I can do that in the custom designer, but I have a script that does it automatically. I'm going to run the script, and it's going to do 10,000 random starts of the custom designer behind the scenes. Then you'll see the resulting design pop up as soon as it's finished getting through all these 10,000 random starts. Here is the design, and you can see that the factors are my four continuous factors: the block, the treatment, and the order effect. These are the covariate factors that came from this table here. There are 28 rows in this table. I'm calling these factors covariate factors because I'm forcing them into the design as they are. I've already created the design, and it's matched up the four continuous factors and all of their rows with the Balanced Incomplete Block Design that have the treatment, the block, and order variable. Now I can show you the table of the design. What I've done is I've sorted this table by the order call, because I want, for instance, the first block, I want the order to go 1, 2, 3, 4, and the second block again, 1, 2, 3, 4, and so forth. I'm controlling the order of the runs in a non-random way, but I've now made order be orthogonal to treatment an orthogonal block. When I evaluate this design, one thing I want to show you is how well I can estimate the continuous factors compared to an absolutely completely orthogonal design, a completely orthogonal design, the fractional increase in the confidence interval would be 0 here. What we see here are numbers that are 0 .01 or 0.011, which is to say that a confidence interval for the main effect of factor 1 is 1 % longer than the confidence interval would be if you could make a completely orthogonal design for this case. I'm going to select all of these and remove these terms so that I can show you the correlation cell plot without a bunch of noise. This is the correlation cell plot for this design, showing the orthogonality of the main effects of the four continuous factors. The block variable is orthogonal to them, the treatment variable is orthogonal to them, and the only thing that's not orthogonal to the four continuous factors is the order effect. But the order effect is orthogonal to the blocks in the treatments. There's very minimal correlation. That correlation is leading to almost no loss of information or increase in variance of the continuous factors in the design. The result of doing it this way is a much simpler design structure so that analysis of this design will be easier for even a novice in design to do. That is all I have for you today. Thanks for your attention.
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Improved Heart Failure Prediction Using Model Screening Platform (2022-US-EPO-1137)
Monday, September 12, 2022
Cardiovascular disease is the number one cause of death globally, claiming an estimated 17.9 million lives in 2019, accounting for 32% of all deaths worldwide that year. Heart failure is a common illness of cardiovascular disease, and this dataset contains 11 features that can be used to predict likely heart disease. The prediction results can help people with cardiovascular disease or high cardiovascular risk (due to the presence of one or more risk factors, such as hypertension, diabetes, hyperlipidemia, or established diseases) to predict early symptoms and detect disease risk in a timely manner. The data set included 918 participants from different countries and 11 factors associated with heart failure, such as age, sex, blood pressure, blood glucose, etc. This study plans to use different analysis models in JMP software for statistical analysis of data sets, such as neural networks, logistic classifiers, Random Forest, etc. The optimal prediction model is selected by comparing model performance. The model output will help people understand the importance of different factors leading to heart disease and the probability of developing heart disease under certain conditions, to help people pay more attention to the management of physical health in daily life and the prediction of disease risk. Hello. Good morning, everyone. This is Saijac Lami and I have my teammate, Zhe Diao. Basically, we are a business analytics graduate students from University of Connecticut-Stamford Campus. Little about our exposure to JMP. We have extensively used JMP in our P rediction Model course in our first semester . We felt it's a very easy and very powerful tool, and there is a lot we can do from it. We are still exploring the many features of the JMP. Today, we are here to present our work of which we did during summer that is a heart failure prediction using modern screening platform. We are calling it improved because we use several JMP platforms to leverage predictions. Coming to the agenda today, so this is just an overview slide from which you would get the gist of what we are doing, we talked about followed by the three slides, which we where we talk about pre-processing and some EDA, which we have done, and the modeling. Coming to the introduction, we know that cardiovascular disease is the number one causing of the death globally, claiming an estimated of 70.9 billion lives in 2019. It accounts for around 32 % of deaths worldwide every year. In our problem, so we have taken a gathered data set, and we developed a classification model for classifying the heart disease. We also leverage this predictions using the model screening, modal comparison and attachment feature in the JMP 16. The model output will help in understanding the importance of factors that are leading to heart disease and the probability. We also find the probability of developing the heart disease, unless when under c ertain conditions. Summarizing our objective is to build the best model and find the factors that are leading to the heart failure using the JMP 16 platform. Coming to a methodology and little about our data set, the data set included around 918 participants from different countries. There are 11 factors associated to heart failure, such as age, sex, blood pressure, and blood glucose. How we went to our predictions is we first perform the pre-processing of the data by exploring if there are any missing values or any outliers. In further, we've performed the EDA analysis to understand the importance and the relationship of the each feature in relation to the heart failure. To build the model, we incorporated the following JMP 16 capabilities in our methodology. The first thing is the Model Screening. Which is an efficient platform, a simultaneously fitting, comparing and exploring, selecting and then deploying the best predictive model. The next comes the Model Comparison, which an easy platform to compare and select the best- performing predictive model. Next comes the dashboard is an efficient way to better represent our EDA concisely. We can run any time as the new data is available. I'm coming to our results. This is just overview of the results which we had. Using the model screening, we identified that the Boosted Tree is our best model. We choose and we have also not just on accuracy. We focus on which is having the least one rate, because we do not want our model to have high false positives, because we don't want the heart rate, like heart failure predict patients not to work as not detected. Based on this, we have chosen the Boosted Tree as our best model. Coming to the column contributions, so when we are, when we tried to identify what are the important factors that are causing the heart failure prediction. When using the Boosted Tree, we identified that Exercise Angina, so which is if a person has the in- use pain due to exercise and also Fasting Bs, Blood Glucose level, Resting ECG, ST_slope, Chestpain type are few parameters out of 11, are contributing to around 75% of the heart failure. Next comes in more depth analysis, Zhe Diao will be taking care. Okay, after screening through the basic information of the data, such as target feature, predictive variables, data types, our analysis work will start with data cleaning. We need to deal with missing values and all layers first to get a clean data. JMP provides a variety of ways to explore and deal with them. For the missing value, JMP can display the details in the summary table are rarely display them in cell plot or tree map. Today, we show the statistical table here, which is also a way to get the information we want. We can see that there is no missing value in our data. But when you further explore the data distribution, you will find that some indicators use the number zero to replace the missing value. We treat the data zero with deletion and the media replacements because the value of the lead indicator cannot be zero. For the Outlier, box plot and explore Outlier module are common methods. Today, we use the Outlier Ana lysis function, and there's a multi-variate module, which reflect the distance of a multi-dimensional space into this Mahalanobis Distance Graph. We have retain this Outliers in this analysis because we consider that this is a common phenomenon in medical test results. After completing these steps, we get the clean data, and then we enter the data exploration stage. In this step, we built some commonly used the charts to show some information contained amount data. JMP provides us many choices in this part such as tree map, ring chart, bar chart and zero. From these graphs, we know that the proportion of male suffering from heart failure is twice that are female. 80 % of patients with heart failure have diabetes and 77 persons have no symptoms of chest pain, which reveals the imperceptibility of the disease. After we draw this useful conclusions, we come to the modeling stage to further explore the relationship between data. When you are doing data analysis, you may usually think what model I want to build, or what model performs best. Model screening function in JMP helps us solve this problem in a very intuitive way. You just need to drag the target variable and the prediction variables to the corresponding positions. JMP will help you run also appropriate models. In this analysis, JMP write nine models automatically, including Regression model, Boosted Tree, Neural ne twork and so on. You can get a detailed and clear output. If you only care about the results, the summary table can help you choose the best model, whether you consider residual or fitting degree. If you want to know the detail, the parameters and the results of each model, you just need to click the model you want to view in details part, and you can understand the performance of the model from all aspects. Here we intercept a parameters, computer matrix and profiler. In profiler, you can enter new data to observe the change train of each variable and get the predict the result. We see that the influence of age is not significant, which may be countering to our combination. Where gender, diabetes and ST_ slope are the main influence factors. Moreover, in these results, we pay attention to the misclassification rate, especially is a false negative value, because it means that the patient has heart failure, and we predict that he does not, which may lead to very serious consequence. The best performance model we select in this analysis is supposed to decrease which has the lowest the misclassification rate and the highest sensitivities. Then we can save all the prediction formulas and the results for use in the model comparison. Model comparison provides more concise and intuitive format to show model performance indicators, which is convenient for us to make the final choice. Now, I take through to the last part of our presentation, which is the dashboard. Using the dashboard feature, we created a utility where we added several important features, which we discussed before, and which are critically affecting the heart rate. Here we can interact by providing the inputs Like I can choose male or female, and we can even choose the Chest pain type. Also, what's the ST_s lope pattern, and also, what's the excess in genome? Based on this, I can interact with the utility, and also based on that, it will display the probability of the heart rate, which is a pretty useful feature. That comes to the last part. In conclusion, I just want to summarize. We use the modern screen platform, and through which we explore the best predictive modeling for the heart failure prediction. We also leverage whatever the work we try using the JMP 16 dashboard. Which we develop a utility to develop an interactive platform that outputs the probability of the heart failure based on the input and parameters. That's all we have for today. Thank you.
Labels
(9)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Custom Heat Maps Provide New Insights in Pet Treating (2022-US-30MP-1133)
Monday, September 12, 2022
As the pet food industry continues to expand, one of the product categories that continues to gain momentum is around pet treatment. There are various products available, including ones that provide a cleaning benefit where the texture of the product promotes chewing behavior and “scrubs” the pet’s teeth of plaque as it is consumed. Various methods are used to measure the dental efficacy of these treating products, from which the data is prepped, modeled and reviewed for further insights. To assist in the consumption of the data, I utilized JMP’s mapping capability to create a custom heat map that provided an unrealized layer of insight into the performance of the products beyond what current models and analyses were showing. This additional layer of analysis drove various discussions and investigations to measure, test and enhance claimed benefits of various products, including new product development. By utilizing JMP’s custom mapping capabilities, analytical professionals can help provide additional layers of insights into data that can lead teams into further innovation channels. Welcome, everyone. Today, I will be going through custom heat maps and how they provided some new insights into pet treating. First of all, I'm Jared Shaw and I work for Mars. I've been with Mars for eight years. Prior to Mars, I worked in semiconductor I ntel and IM flash technologies for about 13 years. My background is statistics and education. I've done a lot of consulting over the years as well as teaching others in various statistical methods. I'm married and I have three kids. They're all adopted. I play games on the sidelines, build models and then periodically camping. Then I also tinker around with construction. On the bottom right, there is a room over my garage that I finished. Today, I'll be going through basically the abstract I submitted. Then C&T stands for care and treat. I'll give some background on that for Mars. I'll go through measuring efficacy, research protocols we have for these types of studies. Then I'll get into the JMP portion. This is not a live demonstration of JMP. I'll just be showing some images from the program and talking through some approaches that we used in looking at this data and then ending up with the new approach that I introduced. First of all, the abstract here is I have these custom heat maps and they provide a lot of insights into pet treating. Overall, our intention is to improve the cleaning of pet's teeth with a new treat. We do this by changing texture, changing ingredients. We want to do something that will help impact the teeth, but also be safe and delicious and fun for the pet as well as the pet owner. The results provide a lot of insights, including patterns that we see across the mouth. How does the product affect the teeth? The current graphs and methodologies. The modeling is pretty good, but the methodologies and how we showcase this data is not very good. It doesn't offer very clear insight unless there's a remarkable difference between the products that are being tested. I found that there is a great opportunity to utilize JMP mapping to create some custom maps. These new images, they spawned a whole huge investigation, brought in some new associates to do some great insights and learning. Just from a simple image, brought some great rewards. Maybe to help bottom or to ground us to a baseline. First of all, let's consider what Care & Treats means. Pet care consists of a dry diet and you also have wet diets and then the Care & Treat components. This is split up into two pieces, the treat and then the care. Treating products, they have a high palatability, they excite pets. They may add some extra nutrition and supplement the pet's diet. They're used for training. Positive reward in training, getting the pet to respond to your voice, etc. In some cases, they're long- lasting to help relieve boredom, reduce destructive behavior and so forth. The care products, on the other hand, these are also have a high probability to encourage consumption. But these promote a healthy teeth that we concentrate a lot on oral care, reducing bad breath and they can also act as a medication to give appeal to your pet. You may have heard of like a pill pocket, et cetera. Now, one of the main drivers of these treats is texture and that really promotes the consumption benefits. Does it become fun to chew the product? For those that have dogs, you may notice that your dog, in some cases when they start eating a product, they seem to inhale it more than they chew it. The texture piece definitely is something that you want to have them enjoy the experience of biting into the product. This is just an image to help us understand what we're talking about for treating and trying to clean the teeth for our pets. These are just a couple of different images that show the breakdown of the teeth in pets and how we want to understand how a product is impacting these different teeth. How do we measure efficacy? Efficacy is basically how well the product is running. Periodontal disease is the most widespread of oral disease in pets. Companies all over, when they get into the care space, they're looking for how they can take texture, how they can take shapes, how they can build a chew that really affects this oral care. They want to measure the efficacy of the treats. They use different approaches over the years. I'm not really going to go too much into these. This is just more informational... Years ago, there was a Logan & Boyce method. This is a visual measure on the teeth. It was invasive in how they did it against the pets and you can read more about that on your own. But there's also GCPI. This was less invasive to the pets, but still a manual approach. Then recently, probably within the last maybe 4 or 5 years, they came up with this QOLF technique that essentially takes an image of the teeth before and after. We find that this is just much more informative. It gives us much clearer results on how things are proceeding when the pets consume these products. Efficacy formula for itself. First of all, there's what's called an ITS and you'll see this in the data later on. It stands for Individual Tooth S core. Basically, it's looking at how much plaque exists on the tooth. In this case for the data that I'm using is based upon the GCPI approach. The length of the tooth and this gives us an idea of basically, how much plaque is on each one of those teeth. The Chew X, you'll see... Actually, have them called different names. But essentially, this is the treat that's being tested. Then the overall efficacy, this is again, that ability or can we produce the intended result from the product? This looks at the calculation takes the no chew, subtract the result of the chew and then divides by that no chew and we get that efficacy. The research protocol. The background image here is actually one of our feeding center here in Tennessee. The round sections, there are dog pods. We have several dogs within each one of those pods and then the center building has the cat rooms. What we do is we prepare the pets by cleaning their teeth so they get a professional cleaning. We try to get all of plaque off the teeth to give them a score of zero. We run a crossover design. Essentially, this means that every Chew is going to be administered to every pet. Not at the same time. We break it up into different phases. In each of these phases, the chews are then rotated against different dogs, as I have written here, or cats as well. Scoring this is essentially done between each phase. After a phase of the study, so after those pets, they have their standard diets. They get maybe a treat product at the end of the day and then at the end of whatever prescribed time frame, they measured the amount of plaque that is on the teeth. Cleaning teeth with a score of zero across all the treatments. We just removed these from the study. It's just something that where they're consuming the product. But for whatever reason, that tooth didn't get impacted by the product. Typically, we'll see these with the front teeth, the in cisors, they're used more for cutting. Generally, the products are more about the chewing behavior. This whole mouth, what this is talking about is some cases, we have these individual tooth scores. We have zero on a No Chew, so basically it's missing, or we find that the No Chew has results that are less than the tooth. Basically, No Chew means that for that phase of the study, the dog or cat, they did not receive a treat product to consume, they just had a standard diet. The Chew X means that they had some care treat at the end of the day. We summarize the data across the whole mouth. Sometimes we'll break it up into regions in order to give us an idea of how it's performing. The analysis protocol itself, really these are done with linear mixed models. We have fixed effects with maybe treatments or the regions depending on the study. Then we have random effects that's really focused around the pet ID. Again, the intent here is that we have the effects affect all pets regardless of the specific pets in the study. We also then run specific contrast. This is where we look at different sizes of the mouth, different regions, etc . Over here on the right- hand side in this table is an example of some of those contrast and depending on the number of contracts we run, of course, we're going to use the FWER, to control for inflated error. Then we communicate these results. We take the analysis results, we take images and then we sit down with the stakeholders and we show them which of these Chews was better essentially. Initially, when I started getting involved in these studies, it was very, this Chew did better versus this other Chew for the whole mouth. But as we started bringing in different regions of the mouth, we started seeing some different results and had much more fruitful discussions. This will get us into the analysis. What I'm going to do here is I'm just going to concentrate really on the data visualization component. I'm not going to go too much into the statistics on the modeling piece. This is just about visualization and in this first part is specifically about ways that we are trying to visualize the teeth initially. This is a results, this is from JMP, from running the mixed effects model and then at the end here we are running these contrasts. In these type of results as we look at these, because I have here marked in the center, the Chews would show no difference but areas of the mouth would, particularly the molars. You can see here on the left, I have just Chew by itself being compared and then on the right- hand side, you see I have different areas of the mouth. Different areas of mouth were showing interesting differences but the Chew by themselves compared to No Chew, maybe we're not seeing too much for one of them but some for another. Then we would group them into different sections used in the variability chart. This shows my different Chews with the no Chew and then again the areas of the mouth. In this case, I would see that the mean of the data is here on the left and then the standard deviation of the data is on the right. Definitely, one area of the mouth is operating differently than another, as I can see here on the right- hand side. Let me just turn on my pointer. Over here, we can see that this variability for the lower molars versus the upper molars is different for different Chew. Other ways that we tried to portray this is using graph builder, we used the area of the mouth over here. I forgot to mention this earlier but the IUL, this is in sisors and upper lower canine teeth and then these are the molars. We can see definitely some pattern going on when I compare across different phases. I have phase 1, phase 2 and phase 3. Phase 3, it looks like I have this linear effect of sorts that's occurring between the Chews. It's just how it's showing up visually. It doesn't mean in the order in which they are given, it's just what the data is showing. Looking further into the variability chart, bringing in the phases. We're like "Hey, do we see differences per phase?" Here we really see for this Chew W, the variability was much lower than No Chew and Chew P. It really starts questioning, well, what's going on here? Why is this specifically happening for this Chew? What could we do to understand that better? Another visual that we generated for this study is we again, summarized it by the area of the mouth and then the Chew efficacy for each of the Chews themselves. We can definitely see some differences between the Chews, but overall, they might seem like that they're similar even though we're seeing differences in the areas of the mouth. One of the things that I started asking is like why do we see these differences between areas of the mouth but we don't see across the Chews as much what's going on? Here I generated a plot where down here on the x- axis, I have the different dogs and then the Chew efficacy for the W and P Chews and then areas of the mouth. Definitely, what's interesting here is that particular animals are showing the difference and other animals are not now. We would expect this, given randomness of the study that the Chews are going to behave differently and how the pet is consuming the product. We really started making me think it, much more deeply about the data and say really what's going on here in this data? Do the pets chew the product differently? That led me into this data visualization for the second part because it made me really start thinking about the data, what can I do or how can I look at this differently to bring out this individual component of the pets. I was working with a research scientist and we were going through one of the studies and they happened to have this card. As you can see here on the right hand side, this is just a picture of the card. They had this sitting on their desk and I was sitting there staring at it. I had the idea, "What if I created a tooth map in JMP?" I could then color each of these individual teeth and maybe get some clarity, further clarity in these studies than what we were looking at. I went and contacted our Waltham scientist. Waltham is a site within pet care in the UK and they concentrate on doing research on the pet nutrition. I went and talked to them and one of their scientists drew me up some teeth. Here for this first slide, we have the dog teeth. The map is here on the left that they drew up. On the right is just a visual to give you an idea of the different types of teeth that show up in the dog's mouth. Then we see something similar for cat teeth. Again, on the left is the one that was drawn up for this study. What I did then is this is the map creator. This is a script that's available on the JMP community. This is an older script. It's been a while since it's been updated, but I found that it was very helpful for this scenario and so downloaded the add in and it creates the add- in pull- down menu. You go to the add- in pull- down as you see up here on the upper left. You can click on Map Shapes and then you can do the Custom Map Creator. When this pops up, you get this screen here, again without the teeth. Then you get a couple of empty tables. I dragged the image file onto the map. I gave it a name. Then I go over here to this next section, and basically, I start tracing the teeth. For every single tooth, I would trace it and then I would give it a name. This is an example of after doing all of that work. As you can see, these are all of the individual plot points, is me just clicking around that tooth to try to get the entire shape accurately, so it would show up as accurate as possible on the screen when we look at the plots. Then we have our different files here. You have this XY. This gives you the coordinates. Down here on the graph on the lower left, you see this is a zero, zero. This is essentially like an X, Y coordinate system. It's just telling me where on the graph that particular point is for that particular shape ID. Then I have a name file that gives me the shape ID and then the name of the tooth. In this case, I created a separate file for dog teeth and of course, for cat teeth, since they are different shapes of teeth and different shapes of the jaw. D just stands for dog and then the ID number for that tooth. One thing that I found interesting is when I first created this program, this was a few years ago, I was able to just create the maps, and I created a custom script. People would run the script, it would save the maps to their C drive and everything would work fine. But soon after a couple of years, it no longer worked. It was because Mars entered in some security protocols that basically wouldn't allow us to save maps to the C drive. It basically locked it up. I had to go out and figure out, well, how can I still do this? I want to see the maps, we want to create these maps. Then I found another community forum that talked about putting it out onto the app data for your username roaming, etc . You see the path here and so you put your maps there and it works just as if I put them on the C drive. Here, once you have those files saved on to their proper location, then you go into JMP Graph Builder. If you've never used it down here on the lower left- hand corner of the screen, it doesn't show it, it just says Ma. But that is the map feature. Since I gave these tooth IDs as map component or the name component, then I take that tooth ID and drag it down to that section. When you do, you can see here in the background, I see those teeth showing up. Not all of the teeth show up because for this particular study, I did not look at every single tooth. You can see the incisors are missing. You can right- click on the image and go to Map Shapes and show the missing shapes. When you do that, you get all of the teeth showing up. In addition, I took the different Chews that were investigated and I dragged that up here to the Group X up here at the top. We see these three different maps for each of the Chews and the No Chew. I then take the ITS and pulled it over here to color. Again, ITS is the individual tooth score, about how much plaque is on the tooth. At this point, this has given me the average amount of plaque that is on all of these dog's teeth that were in this particular study. I can start to see where that plaque is showing up on the No Chew . Definitely, it's on these molars especially down here on the bottom molars, on the right- hand side especially. Then I can also see for the different Chews , I could see how maybe some of the canine teeth on average was showing that some of this plaque remained on the teeth. But definitely, I'm seeing some cleaning of the molars. Clicking on done and giving me the bigger image so I can see it in more detail. This leads me into data discovery. I created these maps and they're looking great. People like, "Hey, this is a really interesting way of looking at the data." But I wasn't done there. We started discovering something when we looked at the maps differently. In this case, this is just back to where we were, that same map. What I did is I started, I put a local data filter on and by the dog names. Here I have dog names on the left, turn on a local data filter and I can now filter on each one of these dogs and look at them individually. Now, we don't have time to go through all of them, but I wanted to show just a few of them. Here for Aura. What was interesting for Aura is that we noticed that for this Chew P, that these lower molars on the right- hand side weren't really getting cleaned very well. According to the score, they weren't getting cleaned at all. This started telling me as I looked at this, "M an, this particular pet, Aura would only chew the product on the left side." For this Chew W, we actually saw a different signal. She actually chewed the product that seemed like more on both sides of the mouth. Very interesting results. A gain, these are just two different types of textures that are being looked at for this product. Going down to another dog here, Gretchen. She showed something different. For Chew P, she did very well with that Chew, but for Chew W, she preferred to chew it more on the right side of her mouth than the left side. Again, these are completely two different animals and they chew the product differently depending on the texture and their preference to the texture. Not all dogs, we are starting to see here like the same texture. They're very individual. When I was doing this, I started asking friends of mine, "Do you chew with one side of your mouth for particular products?" Sure enough, as we started collecting that data, we found, like for myself, I like to chew nuts, but only on the left side of my mouth. Others, when they chewed nuts, it didn't matter which side, etc . As we started talking about it and taking record of it, we started to see, "Hey, these pets are consuming product more like an individual human does when they have preferences in how they look at product." And just a couple more dogs to look at here. Bagel this one, the top teeth clean better than the bottom teeth. Just fascinating results. How is that possible? Because the product is when they're chewing it, they're biting down into the product. Your teeth, your top teeth and your bottom teeth are sinking into that material. Why would we get certain components showing up here? Basically, what it means for this particular dog, for Bagel, it's just the rear molars that Bagel was using to chew into the product. The front molars, who wasn't using at all. Then Muck, this is a great example of the dog. Either they're not consuming the treat at all, or they're just inhaling the treat. Those are some of the customer calls that we sometimes get is, "Hey, my dog is not even chewing this product. They're just like taking it and swallowing it whole." Very interesting results. A s we started looking into this data more and more, it really led us to believe that, "Hey, we need a new product. We need to create something that will really bring in a whole mouth clean. A chew experience where the animal likes to chew, likes to really get into the product and consume it and to have that efficacy result where the product is helping to clean the teeth." In summary, what I've learned from this experience is that historical studies for this particular experience were basically giving an average across the whole mouth and it wasn't sufficient in really giving us a good idea of what was happening with the product. Viewing these tooth maps by individual pets started showing some very interesting results that really we couldn't even look at it by reaching into the mouth across all of the pets. We actually need to start looking at it by individual and start classifying it by, "Hey, this particular treat impacted these teeth only, and this particular treat impacted these teeth only." Start classifying it in that way so that we can start learning a lot more about the texture of the product and how it was consumed. Now these findings, we started applying this across all studies, all historical studies. We pulled this into a large analysis that started really digging into it to learn more from the history of what we've done and how it affects things moving forward. Of course, this led into some new product development. Here's an example of what that is. Unfortunately, I can't show it to you. The image is protected. It is not yet released, but it is something that we're investigating. Thank you.
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Getting the Most from Repairable Systems Simulation (RSS): Going Beyond the Documentation (2022-US-45MP-1132)
Monday, September 12, 2022
Repairable systems can be complex. The RSS platform offers several lesser-known or undocumented features that can help users build simulations more closely aligned to how they are running a system. Did you know block replacement can be contingent on its age or the number of times it has failed? Or that maintenance schedules can be based on system state, making it possible to skip maintenance if it would bring the system down? Using an example-based approach, this session will illustrate useful features and functionality not easily found or missing from the documentation and JMP User Community. Topics include getting, setting, and employing system information as well as applications for less commonly used Events and Actions. A JMP Journal with all examples will be available. Thanks for taking some time out, watch this video. I'll assume that you're here because you've used or tried to use the Repairable System Simulation platform within JMP. You might want to learn a little bit more. You might get be frustrated because there's things that you're trying to do that you didn't think you can do. So what I'm going to cover today are things that aren't explained deeply in the documentation, things that I've learned through trial and error that aren't in the documentation, and even stuff that I've learned from talking to the developer about the platform, stuff that you wouldn't be able to find any other way. I'm going to assume that you've had some exposure to this platform. I'm going to assume that you know a little bit about reliability analysis, so I don't have to go into detail in terms of what is a parallel setup, what is something that's K out of N, and so on. So I've got a number of examples I want to cover, a lot of materials. Let's get started. Now, before we actually look at the Repairable System Simulation platform, let me just give you an overview. Hopefully, you've used it before, you're familiar with it, we're going to talk about three different components and aspects about those components that aren't in the documentation or aren't well-explained in the documentation. The first thing we're going to talk about are blocks. Blocks are what I use to build my Repairable System Simulation things like Parallel, K out of N, Standby. So for each block, I can associate an event with that block. So those are these little orange blocks that we see, and for each block, I can associate one or more outcomes or actions. Those are those blue dots that we see, and that's going to help me build up my simulation. For example, I might want to know when a block fails and when that block fails, I might want to repair that block, and that repair might take, let's say, 40 hours. That's the idea behind building these RSS diagrams. Let's start with talking about some of the general properties of the blocks. There are six different blocks in a knot, which allows me to tie things together within the diagram. Five of those six blocks are composites. They were built to be made of two or more subunits. JMP allows you to use them with one subunit. But really, the use case is that they are built up of two or more subunits. Events and Actions are applied to the entire block. So this is an important point. The components within the block can't be treated individually, except for the standby block. They're assigned a single distribution. Taken together, these two bullet points mean that things like repair or maintenance on subunits is not possible within these composite blocks. I've got to wait until the entire block to fail before I can actually do anything with that block before I can repair the block. If I want different distributions within the block, I can't do that either, with the exception of the Standby. And we'll talk about that in an example we have a little bit later on. Subunits for all the composite blocks, except for the Series blocks, are in parallel. The thing that differentiates those different composite blocks are the number of components that are running, when the block fails, and how the block ages. So I put together a little table that goes over all of the composite blocks and how they differ with those characteristics. For all the composite blocks, with the exception of Standby, all the components are running simultaneously. With a composite block, I can pick K of my N units to be running and then the remaining units to be backups. So when do these blocks fail? Obviously, if you have any experience in reliability, a Series block fails when one unit fails. A Parallel block fails when all the units fail, K out of N when you have fewer than K units, and so on. So things that you might expect naturally in reliability analysis, Standby and Sharing have another method of failure. With the standby unit, there is a switching mechanism. There could be one switch for the entire block or one switch for each backup. The Standby can fail if I have a single switch when that switch fails, or if I attempt to use each one of my backup units and the switch on those backup units fail as well. So switch failure could be another cause for the block failing. Same thing goes with Stress Sharing. However, with Stress Sharing, I only have a single switch. The final characteristic that's different between these composite blocks is how they age. For the first three blocks, they age equally. All the subunits within the block age equally. For Standby and Stress Sharing, the subunits can age... For Standby, they can age differently. For Stress Sharing, they definitely age differently. For Standby, I get to pick a single distribution for my operating units. For my backup units, I can choose that they don't age— that's cold standby— or I can give them a different distribution as to how they age when they're in standby mode or when they're acting as backups. That could be the same as my operating units could be something different. So I get to pick two different distributions. The thing is, again, as with all these composite blocks, I get one distribution for all my operating units, one distribution for all my backups. I can't mix and match. Finally, with Stress Sharing, Stress Sharing works a little bit differently. With Stress Sharing, the stress is distributed among the blocks equally. However, when one of the subunits fails, additional stress is associated with the remaining operating units. So the stress changes from failure to failure. We'll see an example of that in a little bit. Now, one of the things that you might have found, or you might realize is that sometimes you want to have a little bit more flexibility and the way I can build flexibility is I can arrange my basic blocks or my basic block as I would a Parallel or K out of N as one of my composite blocks. Basic blocks can be easily arranged like Series, Parallel, or K out of N harder to do with the Standby and Stress Sharing blocks when you've got more than two units. I've got an example in a little bit that shows a Standby setup with two basic blocks. Relatively straightforward how to do that. But if I were to try to extend that to three blocks, it becomes much more difficult. Not impossible, just very difficult to do because I'd have to keep track of what's operating, what's not operating, what happens when the failure occurs and so on. Same thing goes with the Stress Sharing as well. Let's go to our first example. For our first example, we're going to look at a basic K out of N arrangement. We're going to have five subunits. Three of them need to be operational. They're all going to have the Weibull with an Alpha of 2,000, Beta of 1. We're going to assume that when a block fails that it takes 40 hours to replace. We're going to put that arrangement in series with a K out of N composite block. We'll also assume that the K out of N composite block takes 40 hours to replace, despite the fact that there are five subunits within that block that have failed. We'll take a look at what difference does it make whether I treat these as individual items or as my composite item? This is what my setup would look like in RSS. I've kept it very simple. Here are my subunits relative to my K out of N. What makes it a K out of N is in the knot. I can specify how many of the incoming items need to be operating. And here I've said I need three operating. So that's what makes it a K out of N, this arrangement, K out of N. And I'm just going to keep it very simple. The only event I'm going to look at is whether or not the block fails, the only action I'm going to take is to repair as new. Let's go ahead and run this. Relatively fast. I want to spend a little bit talking about the output because what gets sent to the output and the states will help us when we start building our diagrams in the future. Let's just take a look at the order of operations for this very first failure. So we see it's 766 hours unit Sub1 has failed. The block is removed. So that's important to note because that is one of my potential events. So block is removed, I start the process, I replace the block. And after the block is replaced, after the 40 hours I specified in my setup, the system is turned on automatically. So that's another important point. We're going to run into the situation where for some of my events, the system doesn't get turned on automatically, and I'm going to talk about which ones those are. What we'll have to do in those cases is actually turn the system on, or in some cases, when the block doesn't get turned on automatically, turn the block on manually. Here we've got Sub3 fail and you see a similar series of events, Sub2 fails, etc. Let's take a look at what happens when the system goes down. And we know that happens for sure when K out of N fails. Okay, so here's the first case where the system fails. Again, very similar in respect to most of the operations, but you'll notice that it turns the system down. Notice it turned the system down, so I get to turn system down. It turns off all of the blocks in the system. Again, something important to keep in mind, because there are going to be cases where we need to know whether or not a block is on and off. Whether or not we need to turn on a block manually. So here we go to the replacement. Once the replacement is finished, all the blocks are turned on automatically. And again with this replace, with new, the turning off and turning on all happens automatically. So I don't have to worry about that. Finally, I finished my K out N replacement and I'm done until the next failure. Relatively straightforward. If I wanted to take a look at how my system has performed I'm just going to launch my Result Explorer. By default, I am not going to see the component level information. If I want the component level information, I need to go back up to the hotspot and pick if I want an overall, I'm going to pick my downtime by component. Now, the thing to keep in mind with this organization, this is just telling me the downtime of the system associated with my components. So this tells me how many times K out of N has taken the system down. For each one of these, these corresponded to when that subunit took the system down. So if you think about it, these correspond to the times when subunit was the last of my five units to go down, because that's the only time the system is going to go down is when all of those five subunits go down. Now, if I want to look at my component level distribution, this only tells me when it brings the system down. But say I'm interested in how often Sub1 was down, how often Sub2 was down and so on. The hotspot, I can look at this Point Estimation for Component Availability. And if I scroll down, for example, for Sub1, this gives me my distributional information for the component itself. Z includes the times not only when Subunit 1 took the system down, but when Subunit 1 went down itself. Easiest way to dive into the individual distributions of all my subcomponents. Let's move on to some of these actions and events and some of the particular properties associated with those. We'll start with the events. Not mentioned in the documentation but Inservice is only available for Basic blocks. So to remind you, Inservice is based on the age of the block and not the age of the time of the simulation. So, for example, if I want to service something after 100 hours of use, I would use an Inservice. Otherwise, I could use just a scheduled event. A scheduled event would be serviced once a week, but Inservice is only available for my Basic block. So tells me if I want to use Inservice, I'm going to have to build up these basic blocks. Initialization, Block Failure, and System is Down can only be used once for each block. So what you'll see is when you use them, they disappear from your palette of events, so it won't show up anymore. So you can only use these once. Let's talk about actions. There are 13 actions that I can use. Install Used, Change Distribution, and If can only be used with Basic blocks. Minimal Repair can only be used with Basic and Series. So that leaves us nine actions that I can use for any of the blocks. So they are the Turn On and Off a Block, Turn On and Off the System, Replace with New, Install New, Remove a Block, Inspect Failure, and Schedule. So those are the nine that any of the block, either the Basic or the Composite blocks can use. There is no limit to how many times an action can be used to a given block. So I have an example where I used Turn Off Block more than once. The reason is that what I want to do after I turn or when I turn on the block might differ depending on when the block gets turned on. And again, we'll see an example of that in a little bit. Actions can only be connected to other actions. As a matter of fact, events can only be connected to actions. So I cannot connect an event or an action to a block or to another event. However, I'm unlimited in terms of how many actions I can chain together. A few specific properties. Initialization only occurs once, and it's at the very start of the simulation run. We're going to have an example where that makes a difference, where I turn off a block at an initialization, but when the system comes back online, that block, it's turned on again. So I'm going to have to make sure I turn that off manually. So Initialization only happens at the beginning. System is Down occurs for every block, regardless of which block brought the system down. So I can use that as a listener to see... For example, if the system goes down, I might want to perform maintenance on the block even though the block was running, maybe just to bring it back up to near new or to bring it back up as new. So a System is Down will occur regardless of which block brought it down. Then about the actions, Turn on System turns on every available block. Now, by every block, I mean everything that's not already down because of failure or scheduled to be down. So I could schedule something to be down either by using that scheduled event or by using the scheduled action. Turn on the System is going to turn on all of my blocks. Blocks got to be turned on manually after Install New, Install Used, or Change Distribution. Those are a couple of those cases where the system does not get turned on, the block does not get turned on automatically. This is why I might want to use, for example, install new instead of Replace with New. Install New lets me do stuff before I turn the system back on. It gives me that chance to perform other operations or other actions before I turn on my system. Turn on Block does not turn on the system if the block was turned off manually. As we saw in the output, as you might have noticed in the output, that regardless of the system being on, you might get a Turn on System event show up. It does not hurt to turn on a system if the system is already on. What I say is, if in doubt, add a Turn on System to your chain of events to make sure the system is on. By the way, it doesn't matter whether the block is turned on first or the system is turned on first. Both work the same way. All right, let's move on to our second example. What we're going to do is we're going to build a basic block in a Standby arrangement. We're going to have one main unit, one standby unit. We're going to do a cold startup, meaning that the standby unit, the backup unit, does not age. However, it's going to take 30 minutes to start up the unit. Something I can't do with the composite standby. Eight hours of maintenance on the backup is performed after control is passed back to the main unit. We're going to look at two different switch situations. We're actually going to look at an infallible switch and we're going to look at a switch where we've got 90% reliability. For all of my blocks, I'm going to assume Exponential(500). I'm going to assume eight hours to either repair or maintain. Failing blocks will be replaced with new. Like the previous example, we're going to look at the basic blocks in this arrangement in series with a standby block with the same setup, obviously, with the exception of that 30-minute startup time, which I can't do. Let's take a look. Here is our basic arrangement. I've got my backup unit, my main unit, my standby unit, and for each one of these I've associated a failure event and a Replace with New. Sort of the bare bones of what I might start with. I don't know if I pointed out previously when we were looking at the output, but it is helpful when I add an event, add an action to give it a unique name because those are the names that appear in my output. All right, so let's start with a backup unit. When my system comes online at time zero, everything is going to be turned on. I want to make sure that at time zero, that backup unit gets turned off. The way I'm going to do that is I'm going to use Initialization. You'll notice that when I used Initialization, it disappeared. Next thing I want to do is turn off the block. Now, if you have used RSS before, you'll remember that when I try to add an action, it's going to stack these actions on top of one another. Sometimes that's not the best location for them. What you could do anytime you have one action to work from, I can select the action, click and hold this plus sign and then just drag off the plus. Now I can add my action. In this case, I want to turn off the block. Now, I don't want the Turn off Block associated with this, Replace with New. What I'm going to do to avoid that happening is just delete the arrow. Now I can connect to the event I want to. The advantage of doing that is now I can move that block anywhere I want to. I have much more flexibility with the way that block works. The other thing I'm going to do is I definitely want to label these so I know when I look at my result table, I'll know what happens. This is Turn off Block... Let's just call this turn off backup at start. I'm going to give each one of my events, each one of my actions a unique name so I know when they occurr. That turns my backup off at the start. The next thing I'm going to want to do is when the main unit fails, I'm going to want to turn on the backup unit. Now, again, rather than selecting the backup unit and say Turn on Block, I'm going to choose my Turn on Block interactively. I'm going to delete my connection here so I can move this. I'm going to move things around a bit. I'll move this right over here. Again, I want to name this so that if I come back to this diagram in a couple of days, I know what this action is associated with. We'll call this turn on backup from main. Another thing that is mentioned in the documentation, but I don't know how strongly, but I can connect, in this case, I'm connecting an event from one block to an action in another block. I can definitely do that. Now, as I mentioned earlier, this will turn on the block because that block was turned off manually, I'm going to have to turn on the system as well. Turn on the system. Again, I probably want to label that, I'll say, Turn on System from backup and so on. Now the only other thing I need to add to my diagram is that when my when my main comes back online, I want to turn off my backup, do maintenance to my backup. Now I could probably... Well, let's start with, turn off the backup, Turn off Block. Delete that. We'll call this. Must have picked turn on, I got Turn off Block. Let's try it again. Turn off Block and label that, turn off backup from main. There's where my main comes online. There I've turned off my backup. I want to perform maintenance on it. The way I'm going to do that is I'm just going to Replace with New. Okay, we'll just call this preventive maintenance. I think I said that that preventive maintenance is going to be eight hours that it takes. Now again, I've got to remember that Replace with New automatically starts the block. What I'm going to have to do now is turn off the block. Now, thinking about that in a little bit more detail, I could have probably just gone directly from the new coming back online to doing the maintenance. Because once I start the maintenance on my backup, the backup stops operating. Then once I am done with my maintenance, turning off the block. This block doesn't hurt, but it's redundant. So let's just get rid of it. That would be my setup. I always like to run to double-check to see if things operate in the proper order and I'll generate my output. In this case, I've only got it set up to do one simulation. I might have to take a couple of simulations depending on the reliability associated with my units. Here's the main fails, the system turns down, the standby block is turned off, put in a new main, turn on backup for main. If we look at the predator column, there's my turn on backup for main. Turn on System from backup, Turn on Block by system. Again, I like to step through these once I build them to make sure things happen in the right order. Turn system gets turned on. There's the new main is put in. I turn on my system, again, it's already on, but doesn't matter, doesn't hurt. New main Replace with New. That is the block, the backup being turned to getting a PM and so on. There should be that backup being turned off. Again, I like to use those just to double-check to make things are working properly. That's the setup without the switch. To add the switch is relatively straightforward. Where we would put that is, we would put that between the main failing and the backup starting and that's what the if blocks are made for. Let me get rid of that connection. It doesn't matter which of my blocks that if comes from and I didn't want to put it there. Let me go from... Here we go. I've got a backup just in case something like that happened. Here we go. There's my backup being turned on. There is my main failing. We are going to get rid of that. I probably want to add in my if block so I can float it anywhere. I'm just going to grab any one of these and we'll add an if. Delete the connection. Let's connect that. Now, the way I would set up my if, the way if works is that it evaluates what's in the if condition box. If it's true, if it evaluates to one, then you continue going. If it evaluates to zero, you stop. What I can say is if... I don't need the if there, I'm going to say random uniform is less than 0.9. When that's true the chain will continue, otherwise, it'll stop there. That's it. It's as simple as that to put in that if statement. Again, I probably want to label this too. So if switch is success or something like that. At this point, the only drawback is this what would happen if the switch failed? Well, switch failed, it's going to have to wait until the main comes back online. Main comes back online, it goes tries to turn off my backup. Doesn't hurt trying to turn off something that's already off. But the thing that I might not want to do is I might not want to do the maintenance if the backup wasn't used. In a later example, we'll see how to get around that, how to look into the system and say, is that backup running? If it isn't running, yeah, do the maintenance. If it is running, you don't need the maintenance. Let's move on to the next example. In this case, I've got two pumps operating in parallel with a preventive maintenance performed every two weeks taking eight hours. The PMs are staggered by a week to keep from having both pumps down at the same time. If a pump fails, minimal repair, taking four hours is performed. In this case, simulation is run for a year's time. Let's start with the bare bones here. Here's my pumps with the pump failures and replacements. How do I set up the wait a week before you start the PMs? Like the last example, I'm going to start with initialization. What I'm going to say is wait a week. How do I say wait a week? Well, what I'll do is I will use the schedule action. I'm going to do what I've been doing in the past. We'll set up the schedule, delete my connection. I've got a number of different options with schedule. I only want this to happen once. I want it to happen at the very beginning so it's connected to my initialization event. We're going to call this wait one week. I only want this to happen once. I want it to take 168 hours. That's because I've got my simulation set up in hours. What that does is on start waits a week and then performs whatever action I connect to here. What's the action I'm going to connect? Well, I'm going to use schedule again. What this is going to be is this going to be PM on pump 2. I'm going to leave my max occurrence as blank. So it does it until the end of the simulation. The completion time is going to be 168x2 so 336. Now what happens is that after that first week, every 336 hours, I am going to be able to perform another operation. That operation a couple of different ways to do that. We're going to assume that it's PM [inaudible 00:37:03]. So we're going say, Replace with New. I can say here, replace pump B. Now, one of the things you might be thinking is that what happens if I go to do my preventive maintenance and everything is staggered but for whatever reason pump A is down, it might have had a failure. I wouldn't necessarily take down my pump B for preventive maintenance. Do that and take down the system. What we're going to use is we're going to use a hidden variable called simulation context. Simulation context is the way I can look into my system. What we'll do is we will put the simulation context, we'll put it between my scheduled and my actual preventative maintenance. Let's do this. Let's get rid of that and we will implement it with an if. Let me connect and we'll call this, if pump A is working, now the question is, how do I tell whether pump A is working? Well, in my if condition, I need to specify the hidden variable that's associated with pump A's... Whether it's active or not. The way I would do that is simulation context and case does not matter. Simulation context and the variables, the system variables are stored as an associate of array. This is called status of pump A. Now what I have found is that, this particular part of the dialogue box is not quite as flexible. If I've got something more complex that I want to build, then I'll usually do it in a script window and then copy it and then paste it into this window. But this is relatively small and we'll just go ahead and type away. We're going to say this is active. That's what I would use. Now, how did I know that that's what I have to use? Well, as it turns out, this if condition and a little bit later on when we talk about the stress sharing, this acts like a script. What I can do and what I've often done is put in a print statement, how do I know what is available to me? Let's put in a print statement there and we'll say print simulation context, okay. The only thing that we need to be sure of is that if statement, the last statement in that if condition evaluates the true. Just to show you what happens when I do this, let's just go ahead and run this. I'm going to run that just to give you an idea of what gets printed to the log. Let me go ahead and grab my log and it's over on one of my other… Here we go, there's my log and here we go. This is what got printed to the log. That gives me information on all of the variables that are available in simulation context for this particular diagram. If I had put more blocks in there, more actions, more events then those would show up, information on those would show up too. But this is a good way to tell what's in my system and what are the variables that I have access to. The important thing to remember here is I can read this, but I cannot write back to this particular variable. I don't have that option to change things in my simulation, but I can use that information. Going back to our last example, what I could have done is I could have put an if statement in there and said if my backup was active, well then turn it down, do the maintenance. Okay. Now for the sake of time, I'm going to move on to my last bit of information I want to share with you. By the way, I will have a journal available, should have a writeup available, I'll have more examples that I didn't have time to cover. You'll see a couple more of the actions we didn't have a chance to look at how they get implemented and maybe some of the other things that might not be terribly straightforward. Custom stress sharing is a little bit unusual in the sense that by default, stress is shared equally among the subunits. It's implemented by altering how this block ages and not explicitly altering the distribution. When I have no stress sharing, let's say I have n subunits in parallel, let's say the first one fails at time one, second at time two, and so on. With stress sharing, the way I age the block is that, that first failure occurs at t one times n, times the number of working units. The second one ages the block an additional n minus one times the difference between my first and second failure. The third one n minus two times the difference between the third, second failure and so on. That's the way that the blocks age. This is the information that gets passed to that custom function. When I try to build this into my stress sharing, okay, so let me share with you this last example on how this is set up. Let me find my example. Okay, here I've got three different stress sharing. There's my basic stress sharing, so I've got that set up as basic. Here's my custom. This is the end that I'm talking about. You'll know that it's embedded in this log function. It's done that way so that the aging of the block matches up with the distributions that I can use for my custom stress sharing block. What I do is I think of things in terms of my explanation that the first unit ages at some multiplier times the time, the second one at some multiplier times the difference between my first and second failure times, and so on. That's the way I think about all these things. For example, let's say I wanted to set up, well, before I go there, let me just briefly say, as with the if block I can put print statements in here. If I were to put in a print statement here, you'll notice that this gets executed before the system starts, so it's going to print five and it's going to print five because in this case the block has got five units. Going to print five when the first one fails, it's going to go back into the function print four, and so on, so on and so forth. Let's say I wanted to set this up as a custom failure where the block ages at 10 times, not 5 times, but 10 times the rate, the second one ages at five, the third one at three. The second one ages at one, and the last one is over stressed and ages twice as fast. What I could do is, again, let's put this here. Let me add a little space and make my box a bit bigger. I might put something like, again, I want to put that log, that log has got to be in there, log and I would say something like choose. What choose does based on n, it'll return, let's see, when there's one unit, I want it to return a half, one with two units, let's say 2.5 with 3 units, 5 with 4 units, and then finally 10. That's how I would create my custom stress sharing. I'm not limited to this choose function. Here I've got in this particular example, I'm using a function. Okay, again, the thing that's important here is that this all gets embedded in this log function so that the aging of the block and the distributions can match up. Okay, unfortunately, that's all I have time to cover today. I wish I had more time. There's certainly a lot of other things that we could talk about. Those will be in the materials that I provided for you, and I'd love to hear your feedback and questions and comments that you have. Again, thanks for taking the time. Watch this. I hope this is really beneficial. Thanks.
Labels
(12)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
“Trust Me, I Researched It Online”: Exploring the Bias in Search Engine Results (2022-US-30MP-1131)
Monday, September 12, 2022
Peter Hersh, JMP Senior Systems Engineer, SAS Hadley Myers, Sr. Systems Engineer, JMP When collecting data for an analysis, we are all very cognizant of the need for an unbiased sample and a true representative of a greater population. Great efforts, often at great expense, are taken to ensure that is the case. However, this standard is not always applied to other forms of data collection. For many, research into topics of interest start and end with online searches. Using a designed experiment and the visualization/analytic capabilities of JMP 17, we sought to investigate how different search engines in different parts of the world are potentially biasing search results and, therefore, the conclusions we respectively reach on these topics. Join us for this amusing and thought-provoking presentation that you should totally rate five stars prior to viewing to save time. Thank you all for clicking on this talk and coming to watch it. This is really about bias in data. Every analyst that works on an project understands the importance of ensuring that the data that they collect is unbiased. The steps are taken to avoid that at the start of the data collection, before the projects even really begun, there are numerous checks that points along the analysis, and then at the end, any conclusion reached is taken in the context of potential additional sources of bias. But this same level of care isn't applied to online searches on topics of interest. Search engines use algorithms that are designed to deliver personalized content that is relevant for us as individuals. Now, this has advantages. It means that return search hits are more likely to be relevant and interesting, but it also has disadvantages. By definition, these are not unbiased. We have an example. Yeah, I heard they brought up a great point there, in science and engineering, we take a great care to make sure that our samples are unbiased. But let's think of a library. We walk into the library and there's two people interested in informing themselves on vaccine safety. Let's say they walk into a library and ask a librarian for these books on vaccines, so the first person receives three books. These are actual books. Smallp ox: A Vaccine Success , Anti-vaxxers: H ow to Challenge the Misinformed Movement, and Stuck: How Vaccine Rumors Start and Why They Don't Go Away. Now, let's say a different person walks in and receives three completely different books. THE COVID VACCINE: A nd silencing of our doctors and scientists, Jabbed: H ow the Vaccine Industry, Medical Establishment, and Government Stick it to You and Your Family, Anyone Who Tells You Vaccines Are Safe and Effective is Lying. These are actual book titles. Let's say that looking at who you are, so where you live, how old you are, your gender, maybe even your browser history determines which of these sets of books you get. This is essentially some of the problem with the bias as you go in to search for things online. It may be that before we even start looking at our browser search, we've already got bias in there and we want to understand if that's the case or not. That's what motivated this. You got any thoughts on that, H adley? Well, I think that the thought that I'd like to express right now is that the purpose of this presentation isn't to judge or to opine on the advantages or disadvantages of the search algorithms that may or may not be used. The purpose here really is just to take an example of complex/ unstructured data and complex because it is unstructured and this was search results. Then to use some of the exploratory visual and analytic capabilities found in JMP Pro 17 to try to understand what we were seeing and to present it in such a way to help you to understand. The purpose of this presentation is to inspire you to try these techniques for yourselves and others like them on your own data. Let's go through briefly the methodology. What Pete and I did was we came up with a few search terms we thought would lead to interesting results. You can see those terms here. We define some potential input variables which may or may not be affecting the results of the search. We know that there are very likely others as well that we didn't include. This is true with any designed experiment. We can't capture every variable, but we took a few. We'll see whether these are significant or not. We developed a data collection procedure whereby we use the MSA design in the DOE menu. This is a convenient way to create these tables that we can then send to JMP SEs and friends of SEs. Now, right away this isn't an unbiased random assortment of people we've asked to fill out these. They're all people that work for the same company and have the same job title. As we said, the purpose is really to understand the techniques and methods that we use to try to understand the data, and then to think about how you can apply it yourself. We explored the results which we'll show you. Then finally we presented the findings at the JMP Discovery Summit America 2022, which is what you are watching right now. Without further ado, let's jump into the data. I'll start out by talking just briefly about the MSA design that you see here. What we've done is we've added the factors of interest, we've added the terms of interest that we were looking at. then the nice thing about this is that when we make the table, what we could always do is press this button, send these out to everybody that needed to complete the results for us, send them back, catenate them and then we're ready to start beginning our analysis. But as any analyst who's ever collected data and tried to analyze knows the data very often isn't in a format where you can immediately start with your analysis, some cleaning needs to be done. I'll pass things over to Pete to talk about that. Pete? Yes, great point. I think everybody has gone through this. Even with a well- designed DOE, you oftentimes have to make some adjustments to do the analysis. Hadley showed those operator worksheets that came out, and here is one that I filled out. I'm not going to keep myself anonymous, but I didn't want to share someone else's results. But just to give you an idea, we had folks answer a few demographic questions that hopefully weren't too revealing, but basically where you were located, how old you are, and then the search term you used. Like Hadley showed, there was three responses. We had people do this search and then show the top three responses that that search engine recommended. To do the analysis, the first thing we had to do is just take these three and bring them together. This is a nice easy way to do this, is to go under the columns, utilities and just combined columns. Now I just called these responses and made a little delimiter. I unchecked that multiple response 'cause we're going to just do text analytics on this. Then you get this, which is the table that we, excuse me, the column we're using for Text Explorer. Now, I did this and then I brought in all of the results from all the different people who took the survey and tweaked a little bit more like combined whether you were in the US, which state you were in, and then summarize that into region between America's and Europe, 'cause we didn't have enough respondents to break it up by state. But in the end, we end up with a table that looks like this. We had to do a little bit of recoding, we had to do a little bit of filling this in and then anonymize that search engine. The folks that got the survey knew which search engine to use, but we're not sharing that here. Hadley is going to now talk about some of the results we saw out of this once we had it in this form. Let's open up this dashboard right here. What you're seeing are the most popular terms in order of popularity from left to right, descending order for the first response, second response, and third response for every one of the search terms, for every gender and age and all of the other factors. W can use this and this hierarchical filtering on the dashboard to explore this a little closer, see if we can learn anything. One thing I happen to notice if we look at the world is and we click on male, you'll see that for most people, or for many people, the first search they found was that the world is not enough if you're female, equally likely is to find it your oyster. Interestingly, if you're less than 40 years old, that's when the world is not enough, suddenly becomes the world is yours. I think we could probably agree with it's probably true for people under 40, isn't it? What else have we got here? If I look at climate change, so another of course hot topic of interest these days, as well it should be. If I were to look at people over 50, apparently a huge concern for them is whether climate change is changing babies in the womb, which, interestingly, isn't a concern for people below 40. I wondered whether this is a valid concern for people over 50, whether they're more likely to have their babies change in their wombs or not. But aside from that, let's take a step back and see how we can go about creating this dashboard. It's quite simple. The first thing we need to do is to create our filter variables. I've done that here. Here's our search terms and our distributions. What I'll do is I'll go through how to create the graph builder report, because that's something that you may not be familiar with, that you might be interested in doing. I'm going to take my first response, put it here, and simply choose the number that those occur. Then I can right click and order by count descending. T hat's it. I've done the same thing for my second response and my third response as well. Now we can go ahead putting together the pieces of the dashboard. We'll click on new dashboard, we'll choose the hierarchical filter plus one. I'll take my distribution results put them there, my input parameters, put them there, and then my graphs. Let's see, is this one first? Well, I cant tell. We'll just put them in order like this, and I can always change the order if I want. All right. I'll run the dashboard, and there we have it. It really is as simple as that. Then I can go ahead and save it to the table. That was one use of a dashboard. I'll show you another use of a dashboard, which was to use it with a Text Explorer Word Cloud. This is the most common words, not just entire phrases or entries, but individual words. You can see the word design seem to be used a lot. If I were to look at, for example, statistics, so it looks like everybody can agree that statistics is a science. Interestingly, if you're in Europe, apparently you find it harder than you do if you're in America, where that doesn't come up, so something I happen to notice there. To create this dashboard, it's very much the same as the other one. We'll add our distribution items. Here's the first one, here's the second one. We'll add our Text Explorer Word cloud, and then we'll simply put this one together just as we did the previous one. With that, I'd like to thank you for this part of the presentation about the exploratory visual analysis. I've shown you how you can go about doing this using the hierarchical dashboards. Now I'll turn things back over to Pete, who will take us through some more in depth use of the Text Explorer. Perfect. Thanks, Hadley. Like Hadley mentioned this is a different way to display this, but this is the end result of using the Text Explorer and looking just at the Word Cloud here. He had made this a dashboard and used filters that were graphical of nature, which is great. You could do this also with a local data filter. But this is basically the end result we're going for. Let's now back up and talk about how we got here. With our data set over here, we just launched the Text Explorer under analyze menu, we put in our column that we're interested in. In this case, all three responses combined into one column. We have a bunch of options we can use to tweak this, including language and how we tokenize the words. But we're going to go ahead and just use the defaults. Here you can see since we have different responses to different search terms, the overall term and phrase list by itself is not super informative. What we would want to do is apply that local data filter and the first thing we'll look at is that search term. Now we can do something like the economy or coronavirus or climate change and go from there. Let's focus in on climate change here. One thing that I wanted to do was add some sentiment analysis. The first thing I'm going to go ahead and turn on this Word Cloud so it looks like it had before. Now we can display it this way where you have the most common term in there and you can see it's climate and change. We know that we're searching that, so I could go in here and add these as stop words and now see which ones come up the most frequently when we're mentioning climate change. This is one way to display the Word Cloud. I can also go through here and maybe change this to something that is a little more appealing to the eye, but maybe less useful from a quantitative standpoint. You can always add some arbitrary colors if you like that as well. All right, so I've done this to this point, but now I want to add some sentiment analysis to this. Are people thinking climate change is natural, a good thing or is it a bad thing? You can see some things in here that maybe indicate that, but I wasn't quite sure where to find sentiment analysis. With JMP 17, we have this new feature called Search JMP. If you're ever looking for an analysis in JMP, this is a great way to find that. If I just start typing in sentiment, you can see right here that it tells me how to find this, I can do the help, but I can also just hit show me and it launches it right there. If I'm ever wondering, hey, how can I do this, this gives me the option. Now, a couple of things you see here it's identified some of these default terms that are providing sentiment. Things like good. If I click on good, I get a little summary. It looks like when people are saying good, that is actually a positive sentiment. Now, what about greatest? Oh, boy, almost everything that says greatest is a greatest threat. Maybe that's not actually a positive sentiment there. We might need to do a little bit of tweaking. First let's go in here and say, okay, well, greatest threat is a phrase that we're seeing commonly. I'm going to just add that phrase. Again, you would do this in your curation process, and now you see that that goes away. But I think greatest threat is actually a negative thing. Let's look at those sentiment terms. You can see JMPs identified that as something that maybe has sentiment. I'm going to just say, you know what? That's a really negative sentiment. Now when we go down here, you can see that it's flagged those seven occurrences where they mentioned greatest threat, and it said that those are a very negative. That's changed our overall impression of, do most of these search terms think this is negative or positive? That's just an example of how you can walk through that flow and come up with the end sentiment analysis. I'm going to pass it back over to Hadley and let him wrap things up here. What I'd like to say is that we showed you, first of all, how we went about using the MSA Design to help with the data collection. We use Recode and other items in the Tables menu to help with the data clean up. We then used Distribution, Graph Builder, Text Explorer, and combinations, all of them together to help with the data exploration, see if we can uncover anything interesting. Then Pete used Sentiment Analysis together with the Search and JMP 17 to see what else we can learn about the data as a whole. With that, I hope you found this useful. I hope it's given you some ideas about how you can do this on your own data for yourselves. I'd like to thank you all for listening, and I hope you enjoy the rest of the JMP Discovery Conference. Thank you.
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
How to use JMP in pharma process validation and continuous process verification (2022-US-45MP-1130)
Monday, September 12, 2022
In the pharmaceutical industry, the three PQ batch concept is being replaced by continued process verification. However, there is still a validation event (stage 2) prior to going into commercial manufacturing (stage 3). In stage 2 you are supposed to prove future batches will be good and in stage 3 to prove that the validated state is maintained. JMP has the ideal toolbox for both stages. If the process can be described with a model, prediction intervals with batch as a random factor can be used to predict the performance of future batches. From version 16, prediction intervals also work with random factors. To live up to the assumptions behind the model, JMP has an excellent toolbox to do the sanity check. The prediction limits obtained in stage 2 can be used as the first set of control limits to be used in stage 3. A JMP script that combines the necessary tools described above to calculate control limits, prediction limits and process capability in stages 2 and 3 will be demonstrated. The script has been heavily used by many customers. It only requires that data can be described with a model in JMP. My name is Per Vase from NNE, and I will together with my colleague Louie Meyer represent how you can use JMP in pharma process validation, and actually leading into that you can do continuous process verification afterwards. We both come from a company called NNE, and we do consultancy for the pharmaceutical industry and we work with data science. So we actually trying to create value from data at our customers. Of course, we are extremely happy that we can use JMP for that and this is what we are going to demonstrate today. So the background for the presentation is process validation. This is a very important issue in the pharmaceutical industry, that before you launch a new product, you have to prove that you are capable of producing it. Classically this has been done by just making three batches. If these three batches were okay, we have proved that we can manufacture consistently and then you're ready to launch products. But of course everyone, including the [inaudible 00:01:06] have found out, that you could make three batches is not the same, as all future batches are going to be good. What is expected today from a validation is that not only show that you can make three or could make three batches, you're suppossed to make a set of batches. Based on this set of batches, you should be able to prove with confidence that all future batches will be good. So instead of predicting the past, we now have to predict the future. This really something that is a challenging thing in the pharma industry because they have been used for many years just to make these three batches. So how to do it now? We are really helping many customers with this, and what we strongly recommend is simply predict the future. How can you predict the future? Well, you can do that in JMP with prediction intervals. That's what they are for. So you collect some data, on the days you analyze, you can predict how the rest would look like. If you can just build a model that can describe your validation data set, and if you put your batchess in this model as a random factor, these prediction limits actually cover not only the batches you make, but also the batches you haven't made yet, and thereby you can predict the future. What you're seeing in the bottom of the graph you have stage one. This is where you designed the new process for making the new product. Stage 2 is exam that's the classical old fashioned validation. Where we previously made three batches, now we are making a set of batches. Now based on that we will predict the future batches will be good. If you have a lot of batch to batch variation, it can be hard to predict with just a few batches, due to the width of the T distribution which view the use of freedom, so it might take more than three. But we strongly recommend still to make trees stage 2 because otherwise, patients have to wait too long for the new product and then if it's not enough. You can make the waste in what we call stage 3a. Certain parties that treat hopefully on how your prediction limits will be inside specification limits and then you have proven that all future paths will be good, and you can go into what we call stage 3b, meaning that now you don't have to measure in product testing any longer. You could maybe measure on the process instead of measuring on the product because you have proved all future medicine will be good. You only have to prove that you can maintain the validated state. That can give you a heavy reduction in your future costs. Some customers, they get you up to 70% reductions, in the future costs, after they have proved that all future batches are going to be good. Today we will try to demonstrate, how this can easily be done in JMP. So what is the validation all about? Well, it's the prediction of future batches. Now, previously we just made three batches, but now we have to predict the future batches. How can you predict that with confidence? Because when you do validation you have to prove confidence, things are fine. Well, you can just use prediction intervals. They're also called individual confidence intervals and JMP for the same reason. Or you can also go for tolerances, if you want. How many batches in stage 2? We actually recommend to go on with what they used to, which is three. But you can only pass with the three batches if you control limits. Control limits are actually just predicting limits without confidence. I will show you how to calculate them, and if they are inside. Actually the best guess it's fine. If your predictions are outside, you might need more batches. But you can make these after have gone to the market with your product. How many batches should they make in 3a? Very simple. Until your prediction limits are inside specification intervals. Or if you want, the corresponding PPK is high enough. I will show how you can convert these limits to a PPK. When it's passed you can actually go to stage 3b because now we have proved all the future batches will be good. You just have to prove that you can maintain the validated state. Typically that can be done by measuring in the process, which is easier in real- time, compared to doing in product testing. So that's actually a huge benefit. A lot of people harvest if you're doing it this way. That's a small flow chart for how does it work. So you start out looking at your validation once, you calculate your prediction limits, just be aware that until version 15 they were not right. They agreed to freedom two low prediction limits. It was settled in version 16, so please JMP use version 16 for this. Then if these prediction limits are inside spec limit, you have passed both stage 2 and 3a, and you're waiting for continuous process verification, measuring on the process instead of measuring on the product. If it turns out that the prediction limits are too wide, then look at your control limits. If they are within spec limits, and predicted limits outside, the most common typical course is to just collect data, you just collect more data. But you do that instead 3a after going on to the market. So just recalculate this prediction limit, every time you have a new batch, and then hopefully after maybe three or four extra batches, they are inside your spec limit and then you're ready to go. If it happens that also your control limits are outside specification, then actually you're not ready for validation because then actually best guess is that the process is not capable, and of course you have failed your validation and you need to improve the process and we do everything again. Hopefully we don't end there. But of course it can happen. Just short about this particular methods we have used to calculate these control limits, prediction limits and tolerance limits. For control limits, you need the JMP of course, you can get control limits. But it's actually just for one known mean and one known standard reason. Often you have more complex systems that you might have many means, different cavities and decent moving process for example, you might have several bands components between sampling points. Typically you have more than one mean and one standard reason. But then you can just build a model, and describe your data set with the model. Then you can just take the prediction formula, you can send the standard error or prediction formula and you can take your standard error on residuals, and then instead of multiplying with the G quantile, just use the normal quantile. Then it corresponds to having back confidence. This is how we calculate control limits based on a model. Simply based on the prediction formula and the standard of the prediction, the standard of residuals which we can all get from JMP, then it's pretty easy to calculate how does control limits work. It's even easier to predict the limits, because they are ready to go and JMP. They're already there, and of course they would just use the limits calculated by JMP. I said before be careful with JMP. Before version 16, limits simply gets too wide. So please use version 16 and newer for this. For tolerance limits, we have a little bit of the same issue as with the control limits. You have tolerance limits in JMP, but only for one mean, and one standard deviation. We don't have it for a model. But it's actually pretty easy to convert the classical formula, for a tolerance interval into taking a model. Just replace the mean with the prediction formula. Just replace N minus one with the degrees of region for the total variation, which we actually calculate from the width of the prediction interval. Then you can just enter this in the classical formula and suddenly, we actually also have a tolerance interval and JMP, that can happen within and between batch variation. Then we ready to go with the mathematics here. Some customers prefer not just to look at if limits are inside specification, they need to be deep inside specification, actually corresponding to that PPK is bigger than one. But it's very easy to take the limits and the prediction formula, and convert them into a PPK by this classical formula for PPK. If you predict the limits, you get it with confidence, because the average confidence the same retirement limit, and if you put it in the control limit, you just get PPK without confidence. We'll do without confidence and with confidence. Without confidence is the one we supposed to recommend to use in stage 2, and the other ones are clearly for stage 3a, where we have to prove it with confidence. So life is easy when you have JMP, and if you have a model that can describe your validation dataset. But we have to be aware that behind all models, there are some assumptions that need to be fulfilled, at least to be justified before you can rely on your model. You need to do a standard check of your data, before you can use the model to calculate limits. A s you know, JMP has very good possibilities for this, and I will just now go into JMP, and see how this works. Here is a very well-known data set in the pharma industry, published by Industrial Society of Pharmaceutical Engineering, where they put out a data set from the real world and say, "This is a typical validation set, please try to calculate on this and see how would you evaluate this as a process validation data set." If you start looking at the data chart, you can see here we have three batches. The classical three batches, oldfashioned ways, and we are making tablets here and we are blending the powder. When we are blending the powder, we take out samples in 15 different systematic positions in the blender in all three batches. When we take samples, we take more than one. We actually take four. So we have three batches, 15 indications, and every time we take four. This is a data set from real life that was used for validation. If you look at data, it's actually fairly easy to see that for batch B within a location, the inter batch's location evaluation is much bigger than for batches A and C, and if you put a control limit on your standard deviation, that's clearly C is higher. You can also do heterogeneity of variance test. It will also be fairly obvious that B has significant bigger variance than A and C. So we cannot just pull this because you're only supposed to pull variances, that can be assumed the same. So what to do here? Well, the great thing about JMP, you can just do a balance model, so you can go in and put in the batches as a balance factor on top of having factors to predict the mean. A gain you can clearly see that patch B, has bigger variance than A and B and we need to correct for that. So how do you correct for that in a model? Because I would like now to put in random factors but not where its modelled in JMP do not support random factors. What I do instead, I just go up here and I save my variance formula. So actually I get a column here, where I have my data variances. Then I can just do weighted questions with inverse variance. Classical way of weighing when you're doing linear regression. So it's very easy to correct for that patch B has higher variance than A and C, just by doing weighted regression Being aware of that, then I can start looking do I need to transform a data? You know you have back drops and information and JMP. So it's very easy, just make an ID that describes the combination of batches and location. Then you can look at the procedure variation within each group, and pull this across, and you will get a model like this. You can see I weighted it because I had to weight it. To correct for batch B as I have in A and C. You can then see you have no outliers and you can also box -cox. There's no need for transformation. So that's also easily justified working this way. Then I'm ready to make my model. But when I make my model, I will put in my batch as a random factor. I might also put in location times batch as a random factor, because it should be random because I would like to predict future batches. Then there's actually another assumption, because when you put in random factors you are assuming the influence at least on average zero and it's normally distributed. How to verify this or justify this? Well just go and look at BLUPs, that's just a random factor. You can just make them into a data table, and you can do a BLUP test on them. That's what I have here. Here you can see I have saved my batches' BLUPs and my interaction BLUPs, and I just re -modelled them to these. I can see these batch effects can easily be assumed normally distributed and actually the same with the interaction effects. Now in a very short time, all the sensitive check you need, to be able to justify that you are ready for analysis. So now I have my model, which I'm ready to go for, where I have put in my locations in the random factors, and my pass as my other factors. Now I will give my work to Louie, who will now show how to go on with the script on this data file. As Per's already shown, JMP already offers a lot of opportunities to deal with many of the problems you're facing and validating, for example, TPQ batches. Some of the things he has shown here is that, in fact most processes cannot be described by just one mean and one standard deviation. For this we really like the state model platform, which allows us to do exactly this by including system manufacturers for many means and or random factors for many variances. Then often we see the data requires transformation. Here we already the fit model platform has the box- cox transformation allowing this. Furthermore, data sets almost always contain some kind of outliers. For this we are already in the midmarket platform, also have the student type residual plus, which is a great way of dealing with outliers. Then in cases where we see a lack of variance from beginners, JMP has also the log variance modelling, which we use to find the variance, and convert it into a weight in a recursion model. Then out- of- the- box JMP of course also have the prediction intervals, individual confidence intervals directly from the Fit model. There are also some of the problems we faced where JMP didn't take us all the way but did the groundwork here. I think this is what we're actually trying to address with our script as well, making it much easier, and I think one of the primary reasons we have developed this script is that usually the customers we see that these calculations and visualizations we're doing often are done by human employees. So whenever humans are involved, there is a high risk of human errors as well. This is what we are trying to battle by automating all of these steps later on coming from the analysis and also the visualizations. Then of course, here is the control intervals in JMP, where JMP can only handle one mean and one variance component. This will also solve in the script from the fit and model platform using the prediction folder instead of the mean. Then we use the total variance from the model calculated from the standard error and the prediction formula and the standard error on the residuals. Then we also often face customers who prefer tolerance interval issues, and this is due to their separate settings of both confidence and courage. However briefly described, JMP can only handle one mean and one variance component tolerance intervals. So in the script we also calculate tolerance intervals from the Fit model platform using the same number of effective degrees of freedom as used in our prediction itself. And then lastly, the script has also included the calculation of capability values, more specifically the PDK values, as many customers also require these. Here we calculate them both with confidence using the prediction limits as an input to the calculation. Then we also calculated our confidence using the controller input calculations. So just to go to the script. Our script here, as we have developed package most of these subsequent steps, after doing the sentence check into a single package, the script itself takes three inputs. We have the original data file, and then we have the model window and then we have a template document. So what our script does is that it actually feeds information from the data file and the modern window, into the template documents. From the data file we take the metadata and data itself, and from the model window we take model parameters and model results and fields as well. This is also why it's important to stress here that the model that are used here as an input should be a complete done model. This means that you have done already all the sanity checks SPI just introduced us to. Also you have done your model deduction and the sanity check includes something like outlier exclusion and data transformations as well. One could argue that we could do this in the script as well, obeying some hardcore statistical rules. But we also think that this is a very useful experience. Working with the model here. You get a much better insight into the data, and you also have to apply some process knowledge here to get a better model. So the template document in itself, we actually use this inspired by the app functionality and JMP. We really like some script where the users or each users can actually interact with the script itself. So the template document is essentially a data table in JMP where we have no data in it. However, we have defined a lot of columns with predefined formulas referencing each other. This allows the user to actually go in, and backtrack all our calculations through the columns in this template file, leveraging the for example, formula editor and JMP, and potentially add new columns if they want to see new calculations. Or they can edit some of the calculations we are doing in there as well. Furthermore, all visualizations descriptors providing the user are also actually stored as a table script in the template document. So users can just go in here and edit, for example, the layout of specific graphs if they want new colors or something like this. I think we should take a look at how it actually looks in reality. We will jump into JMP here and I will just find the same model as Per just finished for us. We are again looking at the ISPE case as Pierre showed, and I will just find the same model he ended up with here. Now that we have our model, we have done the sanity check and everything, we have our original data file behind, we simply run the scripts. And we will need to send it to three things here. First of all, we have template files now filled with all the data we need. We see the two graphs we're showing right here. Then we have the PPK batch giving us a PPK for each location in the blender. We have a total of 15 up here, and we have the normal PPK based on control limits in blue, based on prediction limits in red and based on tolerance limits in green. Lastly we had the graph here. This graph shows us partly our data, along with all the limits we have calculated in the script. Yes, and what we see here is in fact that batch A and batch C are performing well for all locations. Meaning it has its prediction limits, and tolerance limits, and control limits for that matter inside the specification for all locations. However, if we look at batch B, where they also found the variance to be much larger than the two other batches, we see that this is in fact going outside specifications on all locations when we look at the prediction limits, not necessarily both specifications, but either under lower and upper. So we see many customers here actually just computing some average limits, across A, B and C. But we don't see that as an opportunity here because for us this tells us that, if future batches behave like batch A or C, we are okay. We can consider and say that future or observations of future batches, will fall inside specifications. However, the future batches behave like batch B, and we cannot guarantee that we will have all observations of future batches falling inside the specification limits here. I think I would just also use this time to go through an example where we actually also take into consideration the whole approach and that showed the flowchart, and how we would actually do a PPQ validation using the script. We've brought a customer example here, just find it here, and this is affecting customers who have been through PPQ with the validation and they have now produced all the batches. So we will just quickly use a local data filter to just look at the three first batches they produced in stage 2. Here we have our model, and we run the script. What we see here is that we find ourselves in a situation where we have both prediction limits and tolerate limits outside big. In fact they are quite far outside. But the best guess determined by the control limit is in fact, that we are inside specification. So what this tells us is that we actually feel safe enough, concluding that we have passed stage 2, and can now go into stage 3a. Actually enable us by the customer to put their products on the market. It's important to stress here that even though we go from stage 2 to stage 3a, we do not reduce the [inaudible 00:23:22], and the production you see we have. We are still fairly certain or really certain actually that, no bad products should enter the market. So what the customer would do next then, is simply to produce the next batch, which would be batch four here included in the model, and then we run it again. Of course you will have to do the sanity check of the model again, including the new data. What we see here now including the fouth badge, is that we are in fact in the same situation as before. However, both the tolerance limits, and the prediction limits, are moving much closer to the control limit, and even more important, they are moving towards the spec, which is what we really want to see in this situation. The reason this is that we actually noted within batch variation very well. However, due to having only three or four batches, we don't know the twin batch variation that will... This is what actually gives us these four limits. This would be an iterative approach, producing one batch, analyzing the results and continues going on until you actually find the limits inside. Now just include the two last batches, run it again. What we see now actually is that we have all our limits inside specification. This tells us that we now have passed stage 3a, and we can now work towards reducing our heavy in productivity, by implementing sampling instead of 100% inspection, and continuous process verification and so forth. This we're still having had our products on the market, since the first three batches we produced. In an image like the graph shown here, you could be concerned that your prediction limits are this broad still. The charterment limits are very close to spec. So if you just go back to the situation where we have three batches, one question here could be, "Are we actually safe, when we have this broad prediction limits?" Could we end up sending bad products to the market? Here there are two very important points. First of all, at this stage, we still have a very heavy unit in product QC. They will ensure bad products does not go to the market. Furthermore, we have to remember here that we are actually not trying to say something about the specific batches we produced. We are trying to say something about future batches. If we want to assess the performance of these individual batches per batch, what we actually have to do is that we have to go back to our model. We go to the model dialog here, and we do not include batch as a random and fit. We use this as a fixed vacant environment instead. I'll just apply the local data filter, and then if we run the script like this, we will actually see how well we know the individual batches here. Here we see, compared to the other graph, which I will just try to put up here, we now have much, much narrower limits for each bad batch, telling us that we don't have to fear that any observations within either of these batches are actually outside specifications. So it's the combination of this, and the heavy in product QC, which actually makes us confident that we can have the three batches, go to the market with the products, as long as our control limits are still within specifications. Yes. So just to conclude on this, I will go here. So, what we have been looking at here is how can we validate, how can the inner validation use JMP to predict the future batches will be okay? So we have seen here that, if you can describe your validation data set with a model, you can actually predict the future with confidence. This can be done either by using the prediction intervals, or the tolerance intervals. So what we have made here and over here, is a script, that actually automates visualizations of prediction intervals and also the PDK values. It calculates and visualize tolerance intervals using the same number of effective degrees of freedom as seen when calculating the prediction intervals. Then it also calculates and visualize the control limits. But it is important to remember before using the script, you have to go through the sentence the check is represented. But here JMP already offers a lot of unique possibilities to justify our assumptions behind the model. This includes variance heterogeneity. So if variances are not equal between the batches as we saw here, we can use the log variance model to find a weight factor, and through this weight factor, I made a weighted progression model instead. We can ensure that our residuals are normally distributed, and if they' re not, we can use the box-cox transformation to transform our data. We can, through the Fit model platform also evaluate potentially outliers through the student plug. If there are any outliers, we always exclude this from the model itself. Then at last we can actually justify whether or not our random factors are normally distributed through the BLUP test. A gain, if we find a level here which does not match, we will ask for the outliers and the studentized residual, exclude these from the model as well. You.
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
«
Previous
1
…
14
15
16
…
29
Next
»