Abstracts

0 attendees

0

Monday, March 9, 2020

Laura Lancaster, JMP Principal Research Statistician Developer, SAS Jianfeng Ding, Senior Research Statistician Developer, JMP In this age of big data and complex manufacturing there is often an enormous amount of process data that needs to be monitored and analyzed to maintain or improve quality. JMP has several tools to help the analyst quickly and efficiently increase the scale of their process monitoring. The Process Screening platform allows users to easily scan processes for stability and capability and enables them to focus attention on processes needing improvement. The platform initially computes a summary report based on control chart, capability and stability calculations, and creates several graphs for quick visual assessment of process health. Based on these initial results, it is easy to select the processes needing attention and explore them more in depth with access to Control Chart Builder and Process Capability. The Model Driven Multivariate Control Chart (MDMCC) platform, new in JMP 15, allows users to monitor large amounts of highly correlated processes. This platform can be used in conjunction with the PCA and PLS platforms to monitor multivariate process variation over time, give advanced warnings of process shifts, and suggest probable causes of process changes. We will use case studies to demonstrate how to use JMP to monitor and analyze many processes for fast and efficient improvement.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 9, 2020

Frank Deruyck, Lecturer, HOGENT University of Applied Sciences and Arts In this presentation an optimal DOE and statistical models are created to maximize performance of a chemical looping process with CO2 capture to generate H2 and synthesis gas, potential new recourses for energy and circular economy. The complex fluidized-bed reactor used is subject to several possible interacting and quadratic effects, as well as random noise, so a thoughtful experimental and modelling strategy is necessary. In JMP the DOE and analysis platforms offer a wide variety of DOE preparation and model fitting options. This paper will illustrate how to decide between an orthogonal RSM, custom DOE and a DSD based on R&D criteria and goals, and model objectives and DOE diagnostics such as power, factor correlation and variance profile. Model building occurs by screening out effective factors using stepwise regression (fixed factor forward selection and all possible models) followed by REML analysis eliminating random noise variance. Useful models for methane conversion and synthesis gas yield are obtained and supported by additional validation experiments. The profiler desirability function is used to compute the optimal operation conditions. This work demonstrates the possibility of optimizing a complex technological process with a careful DOE setting and statistical modeling approach.

0 attendees

0

Event has ended

0 attendees

0

Monday, March 9, 2020

Laura Lancaster, JMP Principal Research Statistician Developer, SAS Jianfeng Ding, JMP Senior Research Statistician Developer, SAS Annie Zangi, JMP Senior Research Statistician Developer, SAS JMP has several new quality platforms and features – modernized process capability in Distribution, CUSUM Control Chart and Model Driven Multivariate Control Chart – that make quality analysis easier and more effective than ever. The long-standing Distribution platform has been updated for JMP 15 with a more modern and feature-rich process capability report that now matches the capability reports in Process Capability and Control Chart Builder. We will demonstrate how the new process capability features in Distribution make capability analysis easier with an integrated process improvement approach. The CUSUM Control Chart platform was designed to help users detect small shifts in their process over time, such as gradual drift, where Shewhart charts can be less effective. We will demonstrate how to use the CUSUM Control Chart platform and use average run length to assess the chart performance. The Model Driven Multivariate Control Chart (MDMCC) platform, new in JMP 15, was designed for users who monitor large amounts of highly correlated process variables. We will demonstrate how MDMCC can be used in conjunction with the PCA and PLS platforms to monitor multivariate process variation over time, give advanced warnings of process shifts and suggest probable causes of process changes.

0 attendees

0

Event has ended

0 attendees

0

Monday, September 21, 2020

近年来，偏最小二乘(PLS)已被用于建立光谱数据的预测模型。使用函数数据分析器（Functional Data Explorer）和实验的协变量设计一种较新的方法，允许在具有良好预测性的预测模型中使用较少的光谱。这种方法使用了四分之一到三分之一的数据，否则这些数据将用于建立基于光谱数据的预测模型。新的多变量平台，如模型驱动的多元控制图（MDMCC）也将作为增强光谱数据分析的方式被展示。

0 attendees

0

Event has ended

0 attendees

0

Monday, September 21, 2020

王部长将通过多个案例阐述统计分析在产品质量回顾、日常质量管理（包括实验室数据分析、预防性管理、变更管理等）、验证管理、产品稳定性考察和药物警戒中的实践应用，并充分运用JMP拟合分布、拟合模型、统计过程控制分析、变异性分析、退化、稳定性检验、筛选、文本分析与挖掘等科学的分析方法在药品质量管理各阶段的广泛应用，确保药品质量和工艺的稳定性与可靠性，从而满足监管机构严格的管理要求，为药品推向市场提速增效。

0 attendees

0

Event has ended

0 attendees

0

Monday, September 21, 2020

在本演讲中，嘉宾将介绍一种构建过饱和设计（SSD）的新方法。此方法会将筛选因素分为几组，并保证不同组中的因素彼此正交，同一组中的因素弱相关，同时指定第一组为假因素组，从而得到误差方差的无偏估计，并开发有效的、基于设计的模型选择程序。仿真结果表明，这些设计与我们的模型选择程序一起使用，可以识别出比以前过饱和设计更多的主动主效应。这些设计及其自动化分析是JMP 15中的新增功能。此次演讲将提供GO SSD方法的演示，采用12次实验，4区组的正交超饱和设计法，就可以得到5个最显著的因素。

0 attendees

0

Event has ended

0 attendees

0

Monday, September 21, 2020

可靠性分配课题研究的是如何决定系统中单个部件的可靠性从⽽保证系统可靠性达到设计要求。同时这个决策过程需要考虑其他制约因素，⽐如提⾼部件可靠性的成本和部件可靠性的可⾏性。从数学⾓度看这个问题，这是⼀个有约束条件的优化问题。其任务是在系统可靠性达标的条件下，使成本最⼩。作者看到，在现有的软件和⽂献中，解决此类问题的⽅法通常假设成本⽅程是部件可靠性的光滑连续⽅程。之所以有这样的假设，是因为他们所采⽤的优化算法需要⽅程具有连续性和可导性。然⽽这样的连续⽅程在实际中要么没有意义，要么很难获得。如果盲⽬地采⽤这类⽅法，使⽤者其实是在解决错误的问题。在本⽂中，作者讨论⼀个⾮常容易接受的成本⽅程。这个成本⽅程可能⾮常普遍。作者借这个成本⽅程来演⽰如何使⽤JMP来分析两个简单系统的可靠性分配的问题。虽然例⼦简单，但是其步骤可以应⽤到复杂系统，或者不同的但更真实的成本⽅程。

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

While journalists have long made use of data in breaking news and investigative reporting, media outlets are increasingly using data visualization as a tool to convey information to the public. Compelling graphics not only make the fact-finding aspects of journalism more transparent, they are also an essential part of the investigative process. Data journalists in particular rely heavily on exploratory data analysis. In this panel discussion, we heard from three data journalists who have used analytics to shed light on some of the most important issues of our time: Anna Flagg’s investigation of the spurious connection between immigration and crime for The Marshall Project; Andrew Ba Tran’s opioid crisis reporting for The Washington Post; and Northeastern University’s Aleszu Bajak’s work on the spread of COVID-19 misinformation.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

The purpose of this poster presentation is to display COVID-19 morbidity and mortality data available on-line from Our World in Data whose contributors ask the key question: “How many tests to find one COVID-19 case?” We use SAS JMP Analyze to help answer the question. Smoothing test data from Our World in Data, yields seven-day moving average or SMA(7) total tests per thousand in five countries for which coronavirus test data are reported: Belgium, Italy, South Korea, the United Kingdom and United States. Similarly, seven-day moving average or SMA(7) total cases per million of were derived using the Time Series Smoothing option. Coronavirus tests per case were calculated by dividing smoothed total tests by smoothed total cases and multiplying by a factor of 1,000. These ratios of smoothed tests to smoothed cases were themselves smoothed. Additionally, Box-Jenkins ARIMA(1,1,1) time series models were fitted to smoothed total deaths per million to graphically compare smoothed case-fatality rates with smoothed tests per case ratios. (view in My Videos) Auto generated transcript: Auto-generated transcript... Speaker Transcript Douglas Okamoto In our poster presentation we display COVID-19 data available from our world and data, who's database sponsors, ask the question why is data on testing important We use JMP version. To help us answer the question. Seven Day moving averages are calculated from January 21 to July 21 Daily per capita COVID-19 tests and coronavirus tests in seven countries United States, Italy, Spain, Germany, Great Britain, Belgium and South Korea. Core by owners test per case where calculated by dividing smooth test by smooth cases and multiplying by a factor 1000 Daily COVID-19 test data yields smoothed test data per thousand in Figure one Testing in LA states in blue trims upward with two tests per thousand daily on July 21st 10 times more than South Korea in red. Which trends downward The x axis is normalized the figure one, two days since moving averages number one or more tests per thousand. In figure two smooth coronavirus cases per million in Europe and South Korea trend downward after peaking months earlier than the US in blue, which averaged 2200 cases per month million on July 21st, with no end in sight. The x axis is normalized to the number of days since moving averages of 10 or more cases per million. Combining tabular results from figure one and figure to smooth COVID-19 test per case in Figure three shows South Korean testing in red peaks at 685 tests per case in May 38 times USP performance in lieu Of 22 tests per case in June. Since the x axis is dated figure three represents a time series. The reciprocal of tests for case cases protest is a measure of product to a positivity one in 22 or 4.5% of positive cases in the US compares with 0.15% positivity in South Korea. And 0.5 to 1.0% in Europe. At a March 30 who press briefing. Dr. Michael Ryan suggested a positive rate less than 10% or even better, less than 3% as a general benchmark of adequate testing. JMP analysis JMP analyzed was used to fit Box Jenkins time series models to smooth test per case in the US for March 13 of April 25 predictive values from April 26 two main ninth or forecast from a fitted model and auto-regressive integrated moving average or ARIMA 111 Model the figure for a time surge of smooth tests per case from mid March to April shows a rise in the number of us test for case not a decline as predicted during the 14 day forecast period. In summary, 10 or more test cases tests were performed per case to provide adequate testing in the United States COVID-19 testing in Europe and South Korea was more than adequate with hundreds of tests per case. Equivalent only the positive rate or number of cases protest was less than 10% in the US. Whereas positivity in Europe and South Korea was well under 3% When our poster was submitted the US totaled 4 million coronavirus cases more than your European countries and South Korea combined Us continues to be plagued by state by state disease outbreaks. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

Pranjal Taskar, Formulation Scientist II, Thermo Fisher Scientific Brian Greco, Formulation Scientist I, Thermo Fisher Scientific Sabrina Zojwala, Formulation Scientist I, Thermo Fisher Scientific Kat Brookhart, Manager, Formulation & Process Development, Thermo Fisher Scientific Sanjay Konagurthu, Sr. Director, Science and Innovation, Drug Product NA Division Support, Thermo Fisher Scientific Pharmaceutical tableting is a process in which an active moiety is blended with inert excipients to achieve a compressible mixture. This mixture is consolidated into the final dosage form: a tablet. The process of tableting considers different composition-related and process variables impacting quality attributes of the final product. This work focuses on using JMP software to identify main effects. An I-optimal, 19-run custom design was outlined with the factors being type and ratio of filler used (microcrystalline cellulose, mannitol vs lactose, categorical), percentage active spray dried dispersion loading (continuous), order and amount of addition (intragranular vs. extragranular, continuous), and ribbon solid fraction (continuous). The responses were outlined as bulk density, Hausner ratio, percentage fines, blend compressibility and tablet disintegration. The model evaluated with the main effects and second degree interaction terms. The data was evaluated using Standard Least Squares in the Fit Model function. Results determined that lactose provided the blend with a higher initial bulk density, however mannitol maintained bulk density post compression. Microcrystalline cellulose improved flow properties of the blend and high percentage intragranular addition provided material with higher bulk density and improved material flow. Auto-generated transcript... Speaker Transcript Pranjal Taskar All right. Thank you, Peter. So I'm going to get started now. Hello everyone. Today I'm going to talk about my poster. This poster is regarding systematic analysis of targeting, which includes effect of formulation and process variables on final quality attributes of my product. So delving into all the statistical analysis before that, I wanted to give a background about what exactly we're talking about. What is tableting? Tableting is a pharmaceutical process. Looking over in the introduction, I'm going to talk about what tableting is a little bit. It's a pharmaceutical process in which your active ingredient or active moiety (API) is blended with other excipients to form a free flowing good flowing blend and this blend is compressed into our final dosage form, which is a tablet. So in a lot of situations, there are some active moieties or APIs, as we would call it, that have a low bioavailability and that could be due to their crystalline nature. They're just too stable, too rigid in their ways. So our site kind of specializes into making this crystalline API, a little bit more soluble, little bit more reactive amorphous form and it makes it into like a more bioavailable form. And when we do that, we fortified this API by a polymer. This this intermediate that we form is a tablet intermediate called a spray dried intermediate, SDI. And this is what we basically use in our tablets as our active intermediate. But when you look at it, it has poor flow ability and it's extremely fluffy. So when you have to incorporate this API into your tablet, you need to have other pharmaceutical processes involved to make it more streamlined, to make the blend more flowable. So this is what we're going to do. In this study, we are going to identify our critical quality attributes, the variables that matter, or our dependent variables and then we are going to identify variables that impact our critical quality attributes, which are the composition of that tablet of that blend and then different process processing parameters that we used in us in tableting. Which of these are main effects? Are there any interactions? And then we'll use JMP to identify all of these main effect and interaction variables and try to catch out the tableting process basically. So this was the introduction. Moving on to the methods and objectives. So how do we do this? For this study we looked at a placebo formulation. There is no active product or actor moiety and we used a commonly used spray-dried polymer which is hypromellose acetate succinate. We spray dried it and made it into the fluffy blend that it usually is. And Figure 2 talks about our usual granulation tabulating process. So, what, what we do is basically have our spray-dried intermediate (SDI) blended along with other excipients using this blender. We move on to roller compaction, which is densification of this blend using these there are rollers right here and these rollers move slowly to densify the blend which goes into this hopper and you get ribbons out of the roller compactor. Now what you have done is you have made that fluffy material into densified ribbons and you mill it down using a comil. And you get granules. These granules are more dense and they are a lot better flowing than your API or your SDI. So looking at this entire process, there are a lot of variables that go in there that you need to change and look out for. So what are those variables? This diagram over here will identify different kinds of variables, the independent variable variables that go into the formulation and process. so The first variable would be a bit more base formulation related than the...rather than the process related. So it would talk about different types of ??? excipients that are used. And the ratio of these excipients that I used the percent of SDI loading, or active loading, and in our case, the placebo loading. And then the order of addition and the point of addition at where the SDI, or other excipients are loaded into the formulation. And then sorting process related parameters such as ribbon solid fraction, which basically talks about this equipment, the roller compactor and the speed at which the rollers and the spools move. We have also identified independent variables of our critical quality attributes that we look out for, which is bulk density of our blend, Which we look at before and after granulation and you have labeled it bulk density 1 and 2. Hausner ratio, which is again a ratio that depicts the flow of your blend and we also identify that before and after granulation, labeled as Hausner ratio 1 and 2. And the percent of fines that collect...are collected in the roller compaction process. And this is usually monitored after granulation. So all of these points out to talk about basically our method and why we chose our variables. What we did was we had an I optimal, 19-run custom design looking at all of these independent variables impacting on the dependent variables. And the way we analyze this model or the way we constructed effects, was that we looked at the main effect and the second degree interactions and we analyzed the data using the standard least squares personality in the fit model function. So, Identifying the process and the objectives, we will move on to results, but before doing that really quickly, I wanted to look at the JMP window which I have pulled up right now. These different columns are my independent and dependent variables and I'm going to highlight right here, these are the different independent variables that we are going to be looking at. So type of filler, which is the type of inert excipient and we have looked at mannitol and lactose. percent SDI, which is the active or in our case placebo loading, looking at highs and lows away here; and amount intragranular, so the amount of our excipients that we add before the roller compaction versus after the roller compaction and outline here are 75 and 95; and mannitol and lactose, which is a filler to MCC, which is micro course design cellulose ratio. Mannitol lactose are, I would say a little bit more excipient and MCC is more ???, gives more strength to the blend. So we have looked at a ratio of this to see how it impacts our tableting blend overall. And on the right are our responses. Bulk density 1, Hausner ratio 1, which is before granulation. Bulk density 2 and Hausner ratio 2, which is after granulation, and percent fines. So I'm gonna go over here quickly into this window and look at how we created our model, our response variables y, that I just talked about. And then our model effects which are secondary interactions and main effects. Standard least squares. That's what we used and I run the model. This is my effect summary right here and based on this data that we're looking at and prior experience, I'm going to take off the last two effects. Just remove that extra noise and then over here, I have my responses and how the data kind of impacts these responses. It would be just easier if we go down and look at the prediction profiler over here. And how all of these dependent variables are impacted by this. So I think it might just be easier if I pull up my poster and... Alright, so looking at the results over here, what we found out from Figure 3 was that, look, the two fillers lactose had higher bulk density initially, but post ruler compaction, the bulk density two of these fillers dropped and you can see a corresponding increase in the fines. So what we think would have happened is that lactose is more brittle in comparison to mannitol. And this generated all of that attrition and that fines and that impacted the flow, making it less bulky, drop in the bulk density. And the Hausner ratio, a little bit higher with the lactose. So basically, what we're doing is targeting a higher bulk density and we want a lower Hausner because a lower Hausner indicates a better flowing blend. So looking at the data, mannitol had a slight edge over lactose as a filler. And the, the second point would be talking about the solid fraction and overall we saw that there was a slight plateauing effect at around .6 solid fraction. Overall, we see that .7 has the least number of fines, which is why we see a recommended .7 with a maximum and desirability, but the plateau effect in terms of your flow properties (bulk and Hausner) start bottoming out at around .6 and onwards. that having lower SDI in general in the formulation had overall better flow properties. Just because the SDI, it's fluffy and it causes the blend to flow a lot worse. So the design just suggested us to have lower SDI loading. a higher amount of that ingredient of that excipient added in an intragranular fashion than an extragranular, just because it improves your bulk, it has a lower Hausner which means that your blend is flowing smoother. We also observed that mannitol to lactose ratio having more of that critical component was more desirable and I see that because overall, the fines have dropped in the presence of having a little bit more of the mannitol lactose component. And that could be the reason why we are seeing this. We also have in the Figure 4, a couple of surface plots of a few interesting trends that I saw. And in Figure 4A, you can see that having a lower SDI loading and having more amount intragranularly resulted in this hotspot right here of a very high Hausner ratio. So when you add a lot of...when you have a low....I'm sorry...have a higher SDI and higher intragranular had an extremely high Hausner ratio. So what this says is basically when you have more of that fluffy material intragranularly, your flow is going to be bad, but you correspond that after granulation, when you again have more more of your excipient intragranularly and you're targeting a solid fraction of about .6 and about, your bulk density improves. So you're basically post granulation, your blend is getting more denser and this is what these two diagrams talk about. So all of the result points basically talk about these things that I discussed right now. Overall, we conclude from our study that in order to optimize this process and maximize desirability for formulations, 1, a higher ratio intragranularly and a lower SDI loading would be a preferable formulation and targeting a solid fraction of around 0.6 would also be beneficial to the formulation. Thank you very much. I would welcome your questions.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

Kelci Miclaus, Senior Manager Advanced Analytics R&D, JMP Life Sciences, SAS Reporting, tracking and analyzing adverse events occurring to patients is critical in the safety assessment of a clinical trial. More and more, pharmaceutical companies and the regulatory agencies to whom they submit new drug applications are using JMP Clinical to help in this assessment. Typical biometric analysis programming teams may create pages and pages of static tables, listings and figures for medical monitors and reviewers. This leads to inefficiencies when the doctors that understand medical impacts of the occurrence of certain events can not directly interact with adverse event summaries. Yet even simple count and frequency distributions of adverse events are not always so simple to create. In this presentation we focus on key reports in JMP Clinical to compute adverse event counts, frequencies, incidence, incidence rates and time to event occurrence. The out of the box reports in JMP Clinical allow fully dynamic adverse event analysis to look easy even while performing complex computations that rely heavily on JMP formulas, data filters, custom-scripted column switchers and virtually joined tables. Auto-generated transcript... Speaker Transcript Kelci J. Miclaus Hello and welcome to JMP Discovery Online. Today I'll be talking about summarizing adverse event summaries and clinical trial analysis. I am the Senior Manager in the advanced analytics group for the JMP Life Sciences division here at SAS, and we work heavily with customers using genomic and clinical data in their research. So before I go through the summarizing and the details around using JMP with adverse event analyses, I want to introduce the JMP Clinical software which our team creates. JMP Clinical is one of the family of products that includes now five official products as well as add ins, which can extend JMP to really allow you to have as many types of vertical applications or extensions of JMP as you want. My development team supports JMP Genomics and JMP Clinical. JMP Genomics and JMP Clinical are respectively vertical applications that are customized, built on top of JMP, that are used for genomic research and clinical trial research. And today I'll be talking about how we've created reviews and analyses in JMP Clinical for pharmaceutical industries that are doing clinical trials safety and early efficacy analysis. The original purpose of JMP Clinical and the instigation of this product actually came through assistance to the FDA, which is a heavy JMP user And their CDER group, the Center for Drug Evaluation and Research. Their medical reviewers were commonly using JMP to help review drugs submissions. And they love it. They're very accomplished with it. One of the things they found though is that certain repetitive actions, especially on very standard clinical data could be pretty painful. Example here is the idea of something called a shift plot which is for laboratory measurements where you compare the trial average of a laboratory of versus the baseline against treatment groups. In order to create this, it took at least eight to 10 steps within the JMP interface of opening up the data, normalizing the data, subsetting it out into baseline versus trial, doing statistics, respectively, for those groups merging it back in, then splitting that data by lab tests so you could make this type of plot for each lab. And that's not even to get to the number of steps within Graph Builder to build it. So JMP clearly can do it, but what we wanted to do is solve their pain at this very standard type of clinical data with a one-click lab shift plots, for example. In fact, we wanted to create clinical reviews in our infrastructure that we call the review builder that are one-click standardized reproducible reviews for many of the highly common standard analyses and visualizations that are required or expected in clinical trial research to evaluate drug safety and efficacy. So JMP Clinical has evolved since that first instigation of creating a custom application for a shift plot into a full-service clinical...clinical trial analysis software that covers medical monitoring and clinical data science, medical writing teams, biometrics and biostatistics, as well as data management around the study data involved with clinical trial collection. This goes for both safety and efficacy but also operational integrity or operational anomalies that might be found in the collection of clinical data as well. Some of the key features around JMP Clinical that we find to be especially useful for those that are using the JMP interface for any types of analyses are things like virtual joins. So we have an idea of a global review subject filter, which I'll show you during the demonstrations for adverse events, that really allow you to integrate and link the demography information or the demographics about our subjects on a clinical trial to all of the clinical domain data that's collected. And this architecture, which is enabled by virtual joins within the JMP interface with row state synchronization, allow you to really have instantaneous interactive reviews with very little to no data manipulation across all the types of analyses you might be doing in a clinical trial data analysis. Another new feature we've added to the software that also leverages some of the power of the JMP data filter, as well as creation of JMP indicator columns, is this ability to, while you're interactively reviewing clinical trial data, find interesting signals that say, in this example, the screenshot shown is subjects that had a serious adverse event while on the clinical trial, find those interesting signals, and quite immediately, create an indicator flag that is stored in metadata with your study in JMP Clinical that's available for all other types of analyses you might do. So you can say, I want to look now at my laboratory results for patients that had a serious adverse event versus those that didn't to see if there's also anomalies that might be related to an adverse event severity occurrence. Another feature that I'll also be showing with JMP Cclinical and the demonstration around adverse event analysis is the JMP Clinical API that we've built into the system. One of the most difficult things of providing and creating and developing a vertical application that has out-of-the box one-click reports is that you get 90% of the way there and then the customer might say, oh, well, I really wanted to tweak it, or I really wanted to look at it this way, or I need to change the way the data view shows up. So one of the things we've been working hard on in our development team is using JMP scripting JSL to surface an API into the clinical review, to have control over the objects and the displays and the dashboards and the analyses and even the data sets that go into our clinical reviews. So I'll also be showing some of that in the adverse event analysis. So let's back up a little bit and go into the meat of adverse events and clinical trials now that we have an overview of JMP Clinical. There's really two kind of key ways of thinking of this. There's that safety review aspect of a clinical trial where that's typically counts and percentages of the adverse events that might occur. And a lot of the medical doctors, monitors, or reviewers often use this data to understand medical anomalies, you know, a certain adverse event starts showing up more commonly, with one of the treatments that could have medical implications. There's also the statistical signal detection, the idea of statistically assessing our adverse events occurring at an unusual rate in one of the treatment groups versus the other. So here, for example, is a traditional static table that you see in many of the types of research or submissions or communications around a clinical trial adverse event analysis. Basically it's a static table with counts percents and if it is more statistically oriented, you'll see things like confidence intervals and p values as well around things like odds ratios or a relative risks or rate differences. Another way of viewing this can also be visually instead of with a tabular format so signal detection, looking at say odds ratio or the, the risk difference might use the Graph Builder in this case to show the results of a statistical analysis of the incidence of certain adverse events and how they differ between treatment groups, for example. So those are two examples. And in fact, from the work we've done and the customers we've worked with around how they view and have to analyze adverse events, the JMP Clinical system now offers several common adverse event analyses from simple counts and percentages to incidence rates or occurrences into statistical metrics such as risk difference, relative risk, odds ratio, including some exposure adjusted time to event analyses. We can also get a lot more complex with the types of models we fit and really go into mixed or Bayesian models as well in finding certain signals with our adverse event differences. And also we use this data heavily in reviewing just the medical data in either a medical writing narrative or patient profile. So now I'm going to jump right into JMP Clinical with a review that I've built around many of these common analyses. So one of the things you'll notice about JMP Clinical is it doesn't exactly look like JMP, but it is. It's a combined integrated solution that has a lot of custom JSL scripting to build our own types of interfaces. So our starter window here lays out studies, reviews, and settings, for example. And I already have a review built here that is using our example nicardapine data. This is data that's shipped with the product. It's also available in the JMP sample library. It's a real clinical trial, looking at subarachnoid hemorrhage. It was with about 900 patients. And so what this first tab of our review is looking at is just the distribution of demographic features of those patients, how many were males versus females, their race breakdowns, what treatment group they were given, their sites that the data was taken from, etc. So this is very common, just as the first step of understanding your clinical data for a clinical trial. You'll notice here we have a report navigator that shows the rest of the types of analyses that are available to us in this built review. I'm going to walk through each of these tabs, just quickly to show you all the different flavors of ways we can look at adverse events with the clinical trial data set. Now, the typical way data is collected with clinical trials is an international standard called CDISC format, which typically means that we have a very stacked data set format. Here we can see it, where we have multiple records for each subject indicating the different adverse events that might have occurred over time. This data is going to be paired with the demography data, which is one row per each subject as seen here in this demographic. So we have about 900 patients and you'll see in this first report, we have about 5,000 or 5,500 records of different adverse events that occurred. So this is probably the most commonly used reports by many of the medical monitors and medical reviewers that are assessing adverse event signals. What we have here is basically a dashboard that combines a Graph Builder counts plot with an accompanying table, as they are used to seeing these kind of tables. Now the real value of JMP is its interactivity and that dynamic link directly to your data so that you can select anywhere in the data and see it in both places. Or more powerfully, you can control your views with column switchers. Now here we can actually switch from looking at distribution of treatments to sex versus race. You'll notice with race, if we remember, we had quite a few that were white in this study, so this isn't a great plot when we look at it by percent or by counts, so we might normalize and show percents instead. And we can also just decide to look at the overall holistic counts of adverse events as well. Another part of using this as this column switcher is the ability to you know categorize what kind of events those were. Was it a serious adverse event? What was the severity of it? Was the outcome that they are when they recovered from it or not? What was causing it? Was it related to study drug? All of these are questions that medical reviews will often ask to find interesting or anomalous signals with adverse events in their occurrences. Now one of the things you might have already noticed in this dashboard is that I have a control group as column switcher here that's actually controlling both my graph and my table. So when I switched to severity, this table switches as well. This was done with a lot of custom JSL scripting specifically to our purposes, but I'll tell you a secret, in 16 the developer for column switcher is going to allow us to have this type of flexibility so you can tie multiple platform objects into the same columns switcher to drive a complex analysis. I'm going to come back to this occurrence plot, even though it looks simple. Here's another instance of it that's actually looking at overall occurrence where certain adverse events might have occurred multiple times to the same subject. I'm going to come back to these but kind of quickly go through the rest of the analyses and these reviews before coming back to some of the complexities of the simple graph builder and tabulate distribution reports. The next section in our review here is an adverse event incident screen. So here we're making that progression from just looking at counts and frequencies or possibly incidence rates into more statistical framework of testing for the difference in incidence of certain adverse events in one treatment group for another. And here we are representing that with a volcano plot. So we can see actually that phlebitis, hypotension and isothenuria occur much more often in our treatment group, those that were treated with nicardipine, versus those on placebo. So we can actually select those and drill into a very common view for adverse events, which is our relative risk for a cell plot as well, which is lots of lot of times still easier to read when you're only looking at those interesting signals that have possibly clinical or statistical significant differences. Sometimes clinical trials take a long time. Sometimes they're on them for a few weeks, like this study was only a few weeks, but sometimes they're on them for years. So sometimes it's interesting to think of adverse event incidents differences as the trial progresses. We have this capability as well within the incidence screen report where you can actually chunk up the study day, study days into sections to see how the incidents of adverse events change over time. And a good way to demonstrate that might be with an exploding volcano plot here that shows how those signals change across the progression of the study. So another powerful idea with this, especially as you have longer clinical trials or more complex clinical trials, is instead of looking at just direct incidence among subjects you can consider their time to event or their exposure adjusted rate at which those adverse events are occurring. And that's what we offer within our time to event analyses, which once again, shown in a volcano plot looking here using a Kaplan Meier test at differences in the time to event of certain events that occur on a clinical trial. One of the nice things here is that you can select these events and drill down into the JMP survival platform to get the full details for each of the adverse events that had perhaps different time to event outcomes between the treatment groups. Another flavor of time to event is often called an incidence density ratio, which is the idea of exposure adjusted incidence density. Basically the difference here is instead of using some of the more traditional proportional hazards or Kaplan Meier analyses, this is more like a a poisson style distribution that's adjusted for how long they've actually been exposed to a drug. And once again here we can look at those top signals and drill down to the analogous report within JMP using a generalized linear model for that specific type of model with an adverse event signal detection. And we actually even offer some really complex Bayesian analyses. So one of the things with with this type of data is typically adverse events exist within certain body systems or classes...organ classes. And so there is a lot of posts...or prior knowledge that we can impose into these models. And so some of our customers, their biometrics teams decide to use pretty sophisticated models when looking at their adverse events. So, so far we've walked from what I would say consider pretty simplistic distribution views of the data into distributions and just count plots of adverse events into very complex statistical analyses. I'm going to come back now, back to what is that considered simple count and frequency information and I want to spend some time here showing the power of JMP interactivity that we have. As you recall one of the differences here is that this table is a stacked table that has all of the occurrences of our adverse events for each subject, and our demography table, which we know we have 900 subjects, is separate. So what we wanted was not a static graph, like we have here, or what we would have in a typical report in a PDF form, but we wanted to be able to interactively explore our data and look at subgroups of our data and see how those percentages would change. Now, the difficulty is that the percent calculation needs to come from the subject count in a different table. So we've actually done this by formula...like creating column formulas to dynamically control recalculation of percents upon selection, either within categorizing events or, more powerfully, using our review subject filter tool. So here for example, we're looking at all subjects by treatment. Perhaps serious versus not serious adverse events, but we can use this global data filter which affects each of the subject level reports in our review and instantaneously change our demography groups and change our percentages to be interactive to this type of subgroup exploration. So here, now we can actually subgroup down to white females and see what their adverse event percentage and talents are, or perhaps you want to go more granular and understand for each site, how their data is changing for different sites. So what we really have here is instead of a submission package or a clinical analysis where the biometrics team hands 70 different plots and tables to the medical reviewer to go through, sift through, they have the power to create hundreds of different tables and different subsets and different graphics, all in one interface. In fact, you can really filter down into those interesting categories. So if they were looking say at serious adverse events and they wanted to know serious adverse events that were related to drug treatment very quickly, now we got down to a very small subset from our 900 patients to about nine patients that experienced serious adverse events that were considered related to the treatment. So as a medical reviewer this is a place where Ithen might want to understand all of the clinical details about these patients. And very quickly, I can use one of our action buttons from the report to drill down to what's called a kind of a complete patient profile. So here we see all of the information now, instead of at a summary level, at a subject individual level of everything that occurred to this patient over time, including when they had serious adverse events occur and their laboratory or vital measurements that were taken alongside of that. One of the other main uses of our JMP Clinical system along with this medical review, medical monitor is medical writing teams. So another way of looking at this instead of visually in a graphic or even in a table which these are patient profile tables, you can actually go up here and generate an automated narrative. So here we're going to actually launch to our adverse event narrative generation. Again, one of the benefits and values of our JMP Clinical being a vertical application relying on standard data is that we get to know all the data and the way it is formatted up up up front, just by being pointed to the study. So what we can do here is actually run this narrative that is going to write us the actual story of each of those adverse events that occurred. And this is going to open up a Word doc that has all of the details for this subject, their demography, their medical history, and then each of the adverse events and the outcomes or other issues around those adverse events. And we can do this for one patient at a time or we can actually even do this for all 900 patients at a time and include more complex details like laboratory measurements, vitals, either a baseline or before. And so, medical reviewers find this incredibly valuable be able to standardly take data sources and not make errors in a data transfer from a numeric table to an actual narrative. So I think just with that you can really see some of the power of these distribution views, these count plots that allow you to drill into very granular levels of the data. This ability to use subject filters to look either within the entire population of your patients on a clinical trial or within relevant subgroups that you may have found. Now one thing about the way our global filter works through our virtual joins is this is only information that's typically showing the information about the demography. One of the other custom tools that we've scripted into this system is that ability to say, select all subjects with a serious adverse event. And we can either derive a population flag and then use that in further analyses or we can even throw that subject's filter set to our global filter and now we're only looking at serious...at a subject who had a serious adverse event, which was about...almost 300 patients on the clinical trial had a serious adverse event. Now, even this report, you'll see is actually filtered. So the second report is a different type of aspect of a distribution of adverse events that was new in our latest version which is incidence rates. And here, the idea is instead of normalizing or dividing to calculate a percent by the number of subjects who had an event. If you are going with ongoing trials or long trials or study trials across different countries that have different timing startup times, you might want to actually look at the rate at which adverse events occur. And so that's what this is calculating. So in this case, we're actually subset down to any subjects that had a serious adverse event. And we can see the rate of occurrence in patient years. So for example, this very first one, see, has about a rate of 86 occurrences in every 10 patient years on placebo versus 71 occurrences In nicardipine. So this was actually one which this was to treat subarachnoid hemorrhage, intracranial pressure increasing likely would happen if you're not being treated with an active drug. These percents are also completely dynamic, these these incidence rates. So once again, these are all being done by JMP formulas that feed into the table automatically that respect different populations as they're selected by this global filter. So we can look just within say the USA and see the rates and how they change, including the normalized patient years based on the patients that are from just the USA, for example. So even though these reports look pretty simple, the complexity of JSL coding that goes beyond building this into a dashboard is basically what our team does all day. We try to do this so that you have a dashboard that helps you explore the data as you know, easily without all of these manipulations that could get very complex. Now the last thing I wanted to show is the idea of this custom report or customized report. So this is a great place to show it too, because we're looking here at adverse events incidence rates. And so we're looking by each event. And we have the count, or you can also change that to that incidence rate of how often it occurs by patient year. And then an alternative view might be really wanting to see these occurrences of adverse events across time. And so I want to show that really quick with our clinical API. So the data table here is fully available to you. One of the things I need to do first off is just create a numeric date variable, which we have a little widget for doing that in the data table, and I'm going to turn that into a numeric date. Now you'll notice now this has a new column at the end of the numeric start date time of the adverse event. You'll also notice here is where all that power comes from the formulas. These are all actually formulas that are dynamically regenerated based on populations for creating these views. So now that we have a numeric date with this data, now we might want to augment this analysis to include a new type of plot. And I have a script to do that. One of the things I'm going to do right off the bat is just create a couple extra columns in our data set for month and year. And then this next bit of JSL is our clinical API calls. And I'm not going to go into the details of this except for that it's a way of hooking ourselves into the clinical review and gaining access to the sections. So when I run this code, it's actually going to insert a new section into my clinical review. And here now, I have a new view of looking at the adverse events as they occurred across year by month for all of the subjects in my clinical trial. So one of the powers, again, even with this custom view is that this table by being still virtually joined to our main group can still fully respond to that virtual join global subject filter. And so just with a little bit of custom API JSL code, we can take these very standard out-of-the-box reports and customize them with our own types of analyses as well. So I know that was quite a lot of an overview of both JMP Clinical but, as well as the types of clinical adverse event analyses that the system can do and that are common for those working in the drug industry or pharma industry for clinical trials, but I hope you found this section valuable and interesting even if you don't work in the pharma area. One of the best examples of what JMP Clinical is is just an extreme extension and the power of JSL to create an incredibly custom applications. So maybe you aren't working with adverse events, but you see some things here that can inspire you to create custom dashboards or custom add ins for your own types of analyses within JMP. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

Wenjun Bao, Chief Scientist, Sr. Manager, JMP Life Sciences, SAS Institute Inc Fang Hong, Dr., National Center for Toxicological Research, FDA Zhichao Liu, Dr., National Center for Toxicological Research, FDA Weida Tong, Dr., National Center for Toxicological Research, FDA Russ Wolfinger, Director of Scientific Discovery and Genomics, JMP Life Sciences, SAS Monitoring the post-marketing safety of drug and therapeutic biologic products is very important to the protection of public health. To help facilitate the safety monitoring process, the FDA has established several database systems including the FDA Online Label Repository (FOLP). FOLP collects the most recent drug listing information companies have submitted to the FDA. However, navigating through hundreds of drug labels and extracting meaningful information is a challenge; an easy-to-use software solution could help. The most frequent single cause of safety-related drug withdrawals from the market during the past 50 years has been drug-induced liver injury (DILI). In this presentation we analyze 462 drug labels with DILI indicators using JMP Text Explorer. Terms and phrases from the Warnings and Precautions section of the drug labels are matched to DILI keywords and MedDRA terms. The XGBoost add-in for JMP Pro is utilized to predict DILI indicators through cross validation of XGBoost predictive models by the term matrix. The results demonstrate that a similar approach can be readily used to analyze other drug safety concerns. Auto-generated transcript... Speaker Transcript Olivia Lippincott What wenjba It's my pleasure to talk about this, Obtain high quality information from FDA drug labeling system and in the JMP discovery. And today I'm going to talk about four portions. The first, I'll give some background information about drug post marketing monitoring and what is the effort from the FDA regulatory agency and industry. And also, I'm going to use a drug label data set to analyze the text and using the Text Explorer in JMP and also use the add in JMP add in XGBoost to analyze this DILI information and then give the conclusion and also the XGBoost tutorial by Dr. Russ Wolfinger is also present in this JMP Discovery Summit so please go to his tutorial if you're interested in XGBoost. So the drug development, according to FDA description for the drug processing, it can be divided by five stages and the first two stages, discover and research preclinical, many in the in the for the animal study and chemical screen, and then later three stages involve the human. And JMP has three products, including JMP Genomics, JMP Clinical, JMP and JMP Pro, that have covered every stage of the drug discovery. And JMP Genomics is the omics system that can be used for omics and clinical biomarkers selection and JMP Clinical is specific for the clinical trial and post marketing monitoring for the drug safety and efficacy. And also for the JMP Pro can be used for drug data cleaning, mining, target identification, formulation development, DOE, QbD, bioassay, etc. So it can be used every stage of the drug development. So in this drug development, there's most frequent single cause, called a DILI, actually can be stopped for the clinical trial. The drug can be rejected for approval by the FDA or the other regulatory agency, or be recalled once the drug is on market. So this is the most frequent the single cause called the DILI and can be found the information in the FDA guide and also other scientific publications. So what is DILI? This actually is drug-induced liver injury, called DILI and you have FDA, back almost more than 10 years ago in 2009, they published a guide for DILI, how to evaluation and follow up, FDA offers multiple years of the DILI training for the clinical investigator and those information can still find online today. And they have the conferences, also organized by FDA, just last year. And of course for the DILI, how you define the subject or patient could have a DILI case, they have Hy's Law that's included in the FDA guidance. So here's an example for the DILI evaluation for the clinical trial, here in the clinical trial, in the JMP Clinical by Hy's Law. So the Hy's Law is the combination condition for the several liver enzymes when they elevate to the certain level, then you would think it would be the possible Hy's Law cases. So you have potentially that liver damages. So here we use the color to identify the possible Hy's Law cases, the red one is a yes, blue one is a no. And also the different round and the triangle were from different treatment groups. We also use a JMP bubble plot to to show the the enzymes elevations through the time...timing...during the clinical trial period time. So this is typical. This is 15 days. Then you have the subject, starting was pretty normal. Then they go kind of crazy high level of the liver enzyme indicate they are potentially DILI possible cases. So, the FDA has two major databases, actually can deal with the post-marketing monitoring for the drug safety. One is a drug label and which we will get the data from this database. Another one is FDA Adverse Event Reporting System, they then they have from the NIH and and NCBI, they have very actively built this LiverTox and have lots of information, deal with the DILI. And the FDA have another database called Liver Toxic Knowledge Base and there was a leading by Dr. Tong, who is our co are so in this presentation. They have a lot of knowledge about the DILI and built this specific database for public information. So drug label. Everybody have probably seen this when you get prescription drug. You got those wordy thing paper come with your drug. So they come with also come with a teeny tiny font words in that even though it's sometimes it's too small to read, but they do contain many useful scientific information about this drug Then two potions will be related to my presentation today, would be the sections called warnings and precautions. So basically, all the information about the drug adverse event and anything need be be warned in these two sections. And this this drug actually have over 2000 words describe about the warnings and precautions. And fortunately, not every drug has that many side effects or adverse events. Some drugs like this, one of the Metformin, actually have a small section for the warning and precautions. So the older version of the drug label has warnings and precautions in the separate sections, and new version has them put together. So this one is in the new version they put...they have those two sections together. But this one has much less side effects. So JMP and the JMP clinical have made use by the FDA to perform the safety analysis and we actually help to finalize every adverse event listed in the drug labels. So this is data that got to present today. So we are using the warning and precaution section in the 462 drug labels that extracted by the FDA researchers and I just got from them. And the DILI indicator was assigned to each drug. 1 is yes and the zero is no. So from this this distribution, you can see there's about one...164 drugs has potential DILI cases and 298 doesn't and the original format for the drug label data is in the XML format and that can be imported by JMP multiply at once. So for the DILI keywords and was a many years effort by the FDA to come up this keyword list. Then they actually by the expert, reading hundreds of drug label and then decided what could potentially become the DILI cases. So then they come up with those about 44 words or terms to be indicated as a keyword, to be indicated for the drug could be the DILI cases. And you may also heard about MedDRA, which is a medical dictionary for regulatory activities. They have different levels of a standardized terms and most popular one is preferred term. I'm going to be using today. So in the warning and precaution, you can see if we pull everything together, you have over 12,000 terms in the warnings and the precautions section. And you can see that "patients" and "may" is a dominant which made not...should not be related to the medical cases and the medical information in this case. So we can remove that, you can see that not any other words are so dominant in this word cloud, but it still have many medical unrelated words like "use" and like "reported" that we could put into... could remove them to our analysis list. So in the in the Text Explorer, we can put them into the stop word and also we normally were using the different Text Explorer technology is stemming, tokenizing, regex, recoding and deleting manually. to clean up the list. But it had 12,000 terms, so it could be very time consuming. But since we have the list we are interested in, so we want to take advantage that we already knew what we are interested in the terms in this system. So what we're going to do and I'm going to show you in the demo that we'll only use the DILI keywords, plus the preferred term from the MedDRA to generate the interesting terms and the phrases to do the prediction. So here is the example we saw using only the DILI keywords. Then you see everything over here, you can see even in the list. You have a count number showed at the side for each of terms, how many times they are repeated in the warnings and precaution section and also you can see more colorful, more graphic in the world cloud to get a pattern recognized. And then we add the medical terms, that was the medical related terms. So it's still come down from the 12,000 terms to the 1190 terms that was including DILI keywords and medical preferred terms. So we think this would be the good term list to start with to do our analysis. So what we do is in the JMP Text Explorer, we can save the term...document term matrix. That means if you see 1 that means this document have seen this term, if it says, if this is 0, this means this document has not see, have a case of this word. So then we, in the XGBoost will make k fold, and three k folds, use each one with five columns. So we use in this machine learnign and use XGBoost tree model which is add in for the JMP Pro and we...using the DILI indicator to as a target variable and they use the DILI keywords and also the MedDRA preferred terms that have shown up more than 20 times to...as a predictor. Then we use a cross validation XGBoost then it 300 times interation. Now we got statistical performance metrics, we get term importance to DILI, and we get, we can use the prediction profiler for interactions and also we can generate and the save the prediction formula for new drug prediction. So I'm going to the demo. So this is a sample table we got in the in JMP. So you have a three columns. Basically you have the index, which is a drug ID. Then you have the warnign and precaution, it could have contain much more words that it's appeared, so basically have all the information for each drug. Now you have a DILI indicator. So we do the Text Explorer first. We have analysis, you go to the Text Explorer, you can use this input, which is a warning and precaution text and you would you...normally you can do different things over here, you can minimize characters, normally people go to 2 or do other things. Or you could use the stemming or you could use the regex and to do all kind of formula and in our limitation can be limited. For example, you can use a customize regex to get the all the numbers removed. That's if only number, you can remove those, but since we're going to use a list, we'll not touch any of those, we can just go here simply say, okay, So it come up the whole list of this, everything. So now I'm going to say, I only care about oh, for this one, you can do...you can show the word cloud. And we want to say I want to center it and also I want to the color. So you see this one, you see the patient is so dominant, then you can say, okay this definitely...not the... should not be in the in analysis. So I just select and right click add stop word. So you will see those being removed and no longer showed in your list and no longer show in the word cloud. So now I want to show you something I think that would speed up the clean up, because there's so many other words that could be in the system that I don't need. So I actually select and put everything into the stop word. So I removed everything, except I don't know why the "action" cannot be removed. And but it's fine if there's only one. So what I do is I go here. I said manage phrase, I want to import my keywords. Keyword just have a... very simple. The title, one column data just have all the name list. So I import that, I paste that into local. This will be my local library. And I said, Okay. So now I got only the keyword I have. OK, so now this one will be...I want to do the analysis later. And I want to use all of them to be included in my analysis because they are the keywords. So I go here, the red triangle, everything in the Text Explorer, hidden functions, hidden in this red triangle. So I say save matrix. So I want to have one and I want 44 up in my analysis. I say okay. So you will see, everything will get saved to my... to the column, the matrix. So now I want to what I want to add, I want to have the phrase, one more time. I also want to import those preferred terms. into the my database, my local data. Then also, I want to actually, I want to locally to so I say, okay. So now I have the mix, both of the the preferred terms from the MedDRA and also my keywords. So you can see now the phrases have changed. So that I can add them to my list. The same thing to my safe term matrix list and get the, the, all the numbers...all the terms I want to be included. And the one thing I want to point out here is for these terms and they are...we need to change the one model format. This is model type is continuing. I want to change them to nominal. I will tell you why I do that later. So now I have, I can go to the XGBoost, which is in the add in. We can make...k fold the columns that make sure I can do the cross validation. I can use just use index and by default is number of k fold column to create is three and the number of folds (k) is within each column is five, we just go with the default. Say, okay, it will generate three columns really quickly. And at the end, you are seeing fold A, B, C, three of them. So we got that, then we have... Another thing I wanted to do is in the... So we can We can create another phrase which has everything...that have have everything in...this phrase have everything, including the keywords and PT, but I want to create one that only have the only have only have the the preferred term, but not have the keyword, so I can add those keywords into the local exception and say, Okay. So those words will be only have preferred terms, but not have the keywords. So this way I can create another list, save another list of the documentation words than this one I want to have. So have 1000, but this term has just 20. So what they will do is they were saved terms either meet... have at least show up more than 20 times or they reach to 1000, which one of them, they will show up in the my list. So now I have table complete, which has the keywords and also have the MedDRA terms which have more than 20, show more than 20 times, now also have ??? column that ready for the analysis for the XGBoost. So now what I can do is go to the XGBoost. I can go for the analysis now. So what I'm going to do show you is I can use this DILI indicator, then the X response is all my terms that I just had for the keyword and the preferred words. Now, I use the three validation then click OK to run. It will take about five minutes to run. So I already got a result I want to show you. So you have... This is what look like. The tuning design. And we'll check this. You have the actual will find a good condition for you to to to do so. You can also, if you have as much as experience like Ross Wolfinger has, he will go in here, manually change some conditions, then you probably get the best result. But for the many people like myself don't have many experienced in XGBoost, I would rather use this tuning design than just have machine to select for me first, then I can go in, we can adjust a little bit, it depend on what I need to do here. So this is a result we got. You can use the...you can see here is different statistic metrics for performance metrics for this models and the default is showed only have accuracy and you can use sorting them by to click the column. You can sorting them and also it has much more other popular performance metrics like MCC, AUC, RMSE, correlation. They all show up if you click them. They will show up here. So whatever you need, whatever measurement you want to do, you can always find here. So now I'm going to use, say I trust the validation accuracy, more than anything else for this case. So I want to do is I want to see just top model, say five models. So what here is I choose five models. Then I go here, say I want to remove all the show models. So you will see the five models over here and then you can see some model, even though the, like this 19 is green, it doesn't the finish to the halfway. So something wrong, something is not appropriate for this model. I definitely don't want to use that one, so others I can choose. Say I want to choose this 19, I want to remove that one. So I can say I want to remove the hidden one. So basically just whatever you need to do. So if you compare, see this metrics, they're actually not much, not much different. So I want to rely on this graphic to help me to choose the best one to do the performance. So then you choose the good one. You can go here to say, I like the model 12 so I can go here, say I want to do the profiler. So this is a very powerful tool, I think quite unique to JMP. Not many tools have this function. So this gives you an opportunity to look at individual parameters in the in the active way and see how they how they change the result. For example those two was most frequently show up in the DILI cases. And you can see the slope is quite steep and that means if you change them, they will affect the final result predictions quite a bit. So you can see when the hepatitis and jaundice both says zero, you actually have very low possibility to get the DILI as one. So is low case for the possible DILI cases. But if you change this line, to the 1, you can see the chance you get is higher. And if you move those even higher. So you have, you will have a way to analyze, if they are the what is the key parameters or predictor to affect your result. And for this, some of them, even their keyword, they're pretty flat. So that means if you change that, it will not affect the result that much. So So this is and also we here, we gave the list you can get to to see what is the most important features to the calculate variables prediction. So you can see over here is jaundice and others are quite important. And for the for the feature result, once you get the data in, this is all the results that we we have. And you can say, well, what...how about the new things coming? Yes, we have here, you can say, I want to save prediction formula. And you can see it's actively working on that. And then in the table, by the end of table, you will see the prediction. So remember we had one...this was, say, well, the first drug, second was pretty much predict it will be the DILI cases and the next two, third, and the fourth, and the fifth was close to zero. So we go back to this DILI indicator and we found out they actually list. The first five was right one. So, in case you have...don't have this indicator when you have the new data come in, you don't have to read all the label. You run the model. You can see the prediction. Pretty much you knew if it is it is DILI cases or not. So my deomo would be end here, and now I'm going to give a conclusion. So we are using the Text Explorer to extract the data keyword and MedDRA terms using Stop Words and Phrase Management without manually selection, deletion and recoding. So we use a visualization and we created a document term matrix for prediction. And also we use machine learning for the using the XGBoost modeling and we want to quickly to run the XGBoost to find the best model and perform predict profile. And also we can save the predict formula to predict the new cases. Thank you. And I stop here.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

Michael Crotty, JMP Senior Statistical Writer, SAS Marie Gaudard, Statistical Consultant, Statistical Consultant Colleen McKendry, JMP Technical Writer, JMP The need to model data sets involving imbalanced binary response data arises in many situations. These data sets often require different handling than those where the binary response is more balanced. Approaches include comparing different models and sampling methods using various evaluation techniques. In this talk, we introduce the Imbalanced Binary Response add-in for JMP Pro that facilitates comparing a set of modeling techniques and sampling approaches. The Imbalanced Classification add-in for JMP Pro enables you to quickly generate multiple sampling schemes and to fit a variety of models in JMP Pro. It also enables you to compare the various combinations of sampling methods and model fits on a test set using Precision-Recall ROC, and Gains curves, as well as other measures of model fit. The sampling methods range from relatively simple to complex methods, such as the synthetic minority oversampling technique (SMOTE), Tomek links, and a combination of the two. We discuss the sampling methods and demonstrate the use of the add-in during the talk. The add-in is available here: Imbalanced Classification Add-In - JMP User Community. Auto-generated transcript... Speaker Transcript Michael Crotty Hello. Thank you for tuning into our talk about the imbalanced classification add in that allows you to compare sampling techniques and models in JMP Pro. I'm Michael Crotty. I'm one of these statistical writers in the documentation team at JMP and my co-presenter today is Colleen McKendry, also in the stat doc team. And this is work that we've collaborated on with Marie Gaudard. So here's a quick outline of our talk today. We will look at the purpose of the add in that we created, some background on the imbalanced classification problem and how you obtain a classification model in that situation. We'll look at some sampling methods that we've included in the add in and that are popular for the imbalanced classification problem. We'll look at options that are available in the add in and talk about how to obtain the add in, and then Colleen will show an example and a demo of the add in. In the slides that are available on the Community, there's also references and an appendix that has additional background information. So the purpose of our add in, the imbalanced classification add in, it lets you apply a variety of sampling techniques that are designed for imbalanced data. You can compare the results of applying these techniques, along with various predictive models that are available in JMP Pro. And you can compare those models and sampling technique fits using precision recall curves, ROC curves, and Gains curves, as well as other measures. This allows you to choose a threshold for classification using the curves. And you can also apply the Tomek, SMOTE and SMOTE plus Tomek sampling techniques directly to your data, which enables you to then use existing JMP platforms and on on that newly sampled data and fine tune the modeling options, if you don't like the mostly default method options that we've chosen. And just one note, the Tomek, SMOTE and SMOTE plus Tomek sampling techniques can be used with nominal and ordinal, as well as continuous predictor variables. So some background on the imbalanced data problem. So in general, you could have a multinomial response, but we will focus on the response variable being binary, and the key point is that the number of observations at one response level is much greater than the number of observations had the other response level. And we'll call these response levels the majority and minority class levels, respectively. So the minority level, most of the time, is the level of interest that you're interested in predicting and detecting. This could be like detecting fraud or the presence of a disease or credit risk. And we want to predict class membership based on regression variables. So to do that we developed a predictive model that assigns probabilities of membership into the minority class and then we choose a threshold value that optimizes various criteria. This could be misclassification rate, true positive rate, false positive rate, you name it. And then we classify an observation, who's into the minority class, if the predicted probability of membership to the minority class exceeds the chosen threshold value. So how do we obtain a classification model? We have lots of different platforms in JMP that can make a prediction for a binary variable, binary outcome when in the presence of regression variables, and we need a way to compare those models. Well, there are some traditional measures, like classification accuracy, are not all that appropriate for imbalanced data. And just as a extreme example, you could consider the case of a 2% minority class. I could give you 98% accuracy, just by classifying all the observations as majority cases. Now this would not be a useful model and you wouldn't want to use it, because you're not predicting...you're not correctly predicting any of your target cases to minority cases but just overall accuracy, you'd be at 98%, which sounds pretty good. So this led people to explore other ways to measure classification accuracy in a imbalanced classification model. One of those is the precision recall curve. They're often used with imbalanced data and they plot the positive predictive value or precision against the true positive rate recall. And because the precision takes majority instances into account, the PR curve is more sensitive to class imbalance than an ROC curve. As such, a PR curve is better able to highlight differences in models for the imbalanced data. So the PR curve is what shows up first in our report for our add in. Another way to handle imbalanced classification data is to use sampling methods that help to model the minority class. And in general, these are just different ways to impose more balance on the distribution of the response, and in turn, that helps to better delineate the boundaries between the majority and minority class observations. So in in our add in we have seven different sampling techniques. We won't talk too much about the first four and we'll focus on the last three, but very quickly, no weighting means what it sounds like. We won't do any...won't make any changes and that's essentially in there to provide a baseline to what you would do if you didn't do any type of sampling method to account for the imbalance. Weighting will overweight the minority cases so that the sum of the weights of the majority class and the minority class are the same. Random undersampling will randomly exclude majority cases to get to a balanced case and random oversampling will randomly replicate minority cases again to get to a balanced state. And then we'll talk more about the next three more advanced methods in the following slides. So first of the advanced methods is SMOTE, which stands for synthetic minority oversampling technique. And this is basically a more sophisticated form of oversampling, because we are adding more minority cases to our data. We do that by generating new observations that are similar to the existing minority class observations, but we're not simply replicating them like in oversampling. So we use the Gower distance function and perform K nearest neighbors on each minority class observation and then observations are generated to fill in the space that are defined by those neighbors. And in this graphic, you can see if we've got this minority case here in red. We've chosen the three nearest neighbors. And we'll randomly choose one of those. It happens to be this one down here, and then we generate a case, another minority case that is somewhere in this little shaded box. And that's in two dimensions. If you had n dimensions of your predictors, then that shaded area would be an n dimensional space. But one key thing to point out is that you can choose the number of nearest neighbors that you randomly choose between, and you can also choose how many times you'll perform this this algorithm per minority case. The next sampling method is Tomek links. And what this method does is it tries to better define the boundary between the minority and majority classes. To do that, it removes observations from the majority class that are close to minority class observations. Again, we use to Gower distance to find Tomek links and Tomek link is a pair of nearest neighbors that fall into different classes. So one majority and one minority that are nearest neighbors to each other. And to reduce the overlapping of these instances, one or both members of the pair can be removed. In the main option of our add in, the evaluate models option, we remove only the majority instance. However, in the Tomek option, you can use either form of removal. And finally, the last sampling method is SMOTE plus Tomek. This combines the previous two sampling methods. And the way it combines them is it applies this mode algorithm to generate new minority observations and then once you've got your original data, plus a bunch of generated new minority cases, tt applies to Tomek algorithm to find pairs of nearest neighbors that fall into different classes. And in this method both observations in the Tomek pair are removed. So the imbalanced classification add in has four options when you install it that all show up as submenu items under the add ins menu. The first one is the evaluate models option, that allows you to fit a variety of models using a variety of sampling techniques. The next three are just standalone dialogues to just do those three sampling techniques that we just talked about. So in the evaluate models option of the add in, it provides an imbalanced classification report that facilitates comparison of the model and sampling technique combinations. It shows the PR curve and ROC curves, as well as the Gains curves, and for the PR and ROC curves, it shows the area under the curve, which generally, the more area under each of those curves, the better a model is fitting. It provides the plot of predicted probabilities by class that helps you get a picture of how each model is fitting. And it also provides a techniques and thresholds data table, and that table contains a script that allows you to reproduce the report that is produced the first time you run the add in. And we want to emphasize that if you run this and you want to save your results without rewriting the entire modeling and sampling methods algorithm, you can save this techniques and thresholds table and that will allow you to save your results and reproduce the report. So now we'll look at the dialogue for the evaluating models option. It allows you to choose from a number of models and sampling techniques. You can put in what your binary class variable is and all your X predictors, and then we, in order to fit all the models and and evaluate them on the on a test set, we randomly divide the data into training validation and test sets. You can provide up...you can set the proportions that will go into each of those sets. There's a random seed option if you'd like to reproduce the results. And then there are SMOTE options that I alluded to before, where you can choose the number of nearest neighbors, from which you select one to be the nearest neighbor used to generate a new case, and replication of each minority case is how many times you repeat the algorithm for each minority observation. Again, there are three other sampling option options in the add in and those correspond to Tomek, SMOTE and SMOTE plus Tomek. In the Tomek sampling option, it's going to add two columns to your data table that can be used as weights for the predict...for any predictive model that you want to do. The first column removes only the majority nearest neighbor in the link and the other removes both members of the Tomek link, so you have that option. SMOTE observations will add synthetic observations to your data table. And it will also it will provide a source column so that you can identify which observations were added. And SMOTE plus Tomek add synthetic observations and the weighting column that removes both members of the Tomek link. And the weighting column from the Tomek sampling and SMOTE plus Tomek, it's just an indicator column that you can use as a weight in a JMP modeling platform. It's just a 1 if it's included, and a 0 if it should be excluded. Most of the three other sampling option dialogues look basically the same. One option that's on them and not on the evaluate models option dialogue is show intermediate tables. This option appears for SMOTE and SMOTE plus Tomek. And basically, it allows you to see data tables that were used in the construction of the SMOTE observations. In general, you don't need to see it, but if you want to better understand how those observations are being generated, you can take a look at those intermediate tables. And they're all explained in the documentation. Again, you can obtain the add in through the Community, through this the page for this talk on the Discovery Summit Americas 2020 part of the Community. And as I mentioned just a second ago, there's documentation available within the add in. Just click the Help button. And now it is time for Colleen to show an example in a demo of the add in. Colleen McKendry Thanks Michael. I'm going to demo the add in now, and to do the demo, I'm going to use this mammography demo data. And so the mammography data set is based on a set of digitized film mammograms used in a study of microcalcifications in mammographic images. And in this data, each record is classified as either a 1 or a 0. 1 represents that there's calcification seen in the image, and a 0 represents that there is not. In this data set, the images where you see calcification, those are the ones you're interested in predicting and so the class level one is the class level that you're interested in. In the full data set, there are six continuous predictors and about 11,000 observations. But in order to reduce the runtime in this demo, we're only going to use a subset of the full data set. And so it's going to be about half the observations. So about 5500 observations. And the observations that are classified as 1, the things that you're interested in, they represent about 2.31% of the total observations, both in the full data set and in the demo data set that we're using. And so we have about a 2% minority proportion. And now I'm going to switch over to JMP to So I have the mammography demo data. And we're going to open and I've already installed the add in. So I have the imbalanced classification add in in my drop down menu and I'm going to use the evaluate models option. And so here we have the launch window, and we're going to specify the binary class variable, your predictor variables, we're going to select all the models and all the techniques and we're going to specify a random seed. And click OK. And so while this is running, I'm going to explain what's actually happening in the background. So the first thing that the add in does is that it splits the data table into a training data set and a test data set. And so you have two separate data tables and then within the training data table those observations are further split into training and validation observations and the validation is used in the model fitting. And so once you have those two data sets, there are indicator variables...indicator columns that are added to the training data table for each of the sampling techniques that you specify, except for those that have involve SMOTE. And so those columns are added and are used as weighting columns and they just specify whether the observation is to be included in the analysis or not. If you specify a sampling technique with SMOTE, then there are additional rows that are added to the data table. Those are the generated observations. So once your columns and your rows are generated then for each model, each model is fit to each sampling technique. And so if you select all of them like we just did here, there are a total of 42 different models that are being fit. And so, that's all what's happening right now. In the current demo, we have 42 models being fit and once the models are fit, then the relevant information is gathered and put together in a results report. And that report, which will hopefully pop up soon, here it is, that report is shown here. And you also get a techniques and thresholds table and a summary table. And so we're going to take a look at what you get when you run the add in. So first we have the training set. And you can see that here are the weighting columns, the weight columns that are added. And these are the columns that are added for the predicted probabilities for those observations. Then we have the test set. This doesn't contain any of those weighting columns, but it does have the predicted probabilities for the test set observations. We have the results report and the techniques and thresholds data table. And so Michael mentioned this in the talk earlier, but this is important because this is the thing that you would like to save if you want to save your results and view your results again without having to rerun the whole script. And so this data table is what you would save and it contains scripts that will reproduce the results report and the summary table, which is the last thing that I have to show. And so this is just contains summaries for each sampling technique and model combination and their AUC values. So now to look at the actual results window, at the top we have dialogue specifications. And so this contains the information that you specified in the launch window. So if you forget anything that you specified, like your random seed or what proportions you assign, you can just open that and take a look. And we also have the binary class distribution. So, the distribution of the class variable across the training and the test set. And this is designed so that the proportion should be the same, which they are in this case at 2.3. And then we also have missing threshold. So this isn't super important, but it just gives an indication of if a value of the class variable has a missing prediction value, then that's shown here. For the actual results, we have these tabbed graphs. And so we have the precision recall curves, the ROC curves, and the cummulative Gains curves. And for the PR curves and the ROC curves, we have the corresponding AUC values as well. We also have these graphs of the predicted probabilities by class. And those are more useful when you're only viewing a few of them at a time, which we will later on. And then we have a data filter that connects all these graphs and results together. So for our actual results for the analysis, we can take a look now. So first I'm going to sort these. So you can already see that the ROC curve and the PR curve, there's a lot more differentiation between the curves in the PR curve than there is in the ROC curve. And if we select the top, say, five, these all actually have an AUC of .97. And you can see that they're all really close to each other. They're basically on top of each other. It would be really hard to determine which model is actually better, which one you should pick And so that's where, particularly with imbalanced data, the precision recall curves are really important. So if we switch back over, we can see that these models that had the highest AUC values for the ROC curves, they're really spread out in the precision recall curve. And they're actually not...they don't have the highest AUC values for the PR curve. So maybe that there...maybe there's a better model that we can pick. So now I'm going to look and focus on the top two, which are boosted tree Tomek and SVM Tomek, and I'm going to do that using the data filter. And then we just want to look at those are going to show and include. So now we have the curves for just these two models and the blue curve is the boosted tree and the red curve is SVM. And so you can see in these curves that they kind of overlap each other across different values of the true positive rate. And so you could use these curves to choose which model you want to use in your analysis, based on maybe what an acceptable true positive rate would be. So we can see this if I add some reference lines. Excuse my hands that you will see as I type this. Okay, so say that these are some different true positive rates that you might be interested in. So if, for example, for whatever data set you have, you wanted a true positive rate of .55. You could pick your threshold to make that the true positive rate. And then in this case, for that true positive rate, the boosted tree Tomek model has a higher precision. And so you could you could pick that model. However, if you wanted your true positive rate to be something like .85, then the SVM model might be a better pick because it has a higher precision for that specific true positive rate. And then if you had a higher true positive rate of .95, you would flip again and maybe you would want to pick the boosted tree model. So that's how you can use these curves to pick which model is best for your data. And now we're going to look at these graphs again, now that there are only a few of them. And this just shows the distribution of predicted probabilities for each class for the models that we selected. So in this particular case, you can see that in SVM there are majority probabilities throughout the kind of the whole range of predicted probabilities, where boosted tree does kind of a better job of keeping them at the lower end. And so that's it for this particular demo, but before we're done, I just wanted to show one more thing. And so that was an example of how you would use the evaluate models option. But say you just wanted to use a particular sampling technique. And you can do that here. So the launch window looks much the same. And you can assign your binary class, your predictors, and click OK. And this generates a new data table and you have your indicator column. Your indicator column, which just shows whether the observation should be included in the analysis or not. And then because it was SMOTE plus Tomek you also have all these SMOTE generated observations. So now you have this new data table and you can use any type of model with any type of options that you may want and just use this column as your weight or frequency column and go from there. And that is the end of our demo for the imbalanced classification add in. Thanks for watching.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

Kamal Kannan Krishnan, Graduate Student, University of Connecticut Ayush Kumar, Graduate Student, University of Connecticut Namita Singh, Graduate Student, University of Connecticut Jimmy Joseph, Graduate Student, University of Connecticut Today all service industries, including the telecom face a major challenge with customer churn, as customers switch to alternate providers due to various reasons such as competitors offering lower cost, combo services and marketing promotions. With the power of existing data and previous history of churned customers, if company can predict in advance the likely customers who may churn voluntarily, it can proactively take action to retain them by offering discounts, combo offers etc, as the cost of retaining an existing customer is less than acquiring a new one. The company can also internally study any possible operational issues and upgrade their technology and service offering. Such actions will prevent the loss of revenue and will improve the ranking among the industry peers in terms of number of active customers. Analysis is done on the available dataset to identify important variables needed to predict customer churn and individual models are built. The different combination of models is ensembled, to average and eliminate the shortcomings of individual models. The cost of misclassified prediction (for False Positive and False Negative) is estimated by putting a dollar value based on Revenue Per User information and cost of discount provided to retain the customer. Auto-generated transcript... Speaker Transcript Namita Hello everyone I'm Namita, and I'm here with my teammates Ayush, Jimmy and Kamal from University of Connecticut to present our analysis on predicting telecom churn using JMP. The data we have chosen is from industry that keeps us all connected, that is the telecom and internet service industry. So let's begin with a brief on the background. The US telecom industry continues to witness intense competition and low customer stickiness due multiple reasons like lower cost, combo promotional offers, and service quality. So to align to the main objective of preventing churn, telecom companies often use customer attrition analysis as their key business insights. This is due to the fact that cost of retaining an existing customer is far less than acquiring a new one. Moving on to the objective, the main goal here is to predict in advance the potential customers who may attrite. And then based on analysis of that data ,recommend customized product strategies to business. We have followed the standard SEMMA approach here. Now let's get an overview of the data set. It consists of total 7,043 rows of customers belonging to different demographics (single, with dependents, and senior) and subscribing to different product offerings like internet service, phone lines, streaming TV, streaming movies and online security. There are about 20 independent variables; out of it, 17 are categorical and three are continuous. The dependent target variable for classification is customer churn. And the churn rate for baseline model is around 26.5%. Goal is now to pre process this data and model it for future analysis. That's it from my end over to you, Ayush. Ayush Kumar Thanks, Namita. I'm Ayush. In this section, I'll be talking about the data exploration and pre processing. In data exploration, we discovered interesting relationships, for instance, variables tenure and monthly charges both were positively correlated to total charges. These three variables we analyzed using scatter plot matrix in JMP, which validated the relationship. Moreover, by using explore missing values functionality, we observed that total charges column had 11 missing values. The missing values were taken care of as a total charges column was excluded due to multicollinearity. After observing the histograms of the variables using exclude outlier functionality, we concluded that the data set had no outliers. The variable called Customer ID had 7,043 unique values which would not add any significance to the target variable. So customer ID was excluded. We were also able to find interesting pattern among the variables. Variables such a streaming TV and streaming movies convey the same information about the streaming behavior. These variables were grouped into a single column streaming to by using our formula in JMP. The same course of action was taken for the variables online backup and online security. We ran logistic regression and decision tree in JMP to find out the important variables. From the effects summary, it was observed that tenure, contract type, monthly charges, streaming to, multiple line service, and payment method showed significant log worth and very important variables in determining the target. The effects on ??? also helped us to narrow down a variable count to 12 statistically significant variables, which formed the basis for further modeling. We use value of ??? functionality and moved Yes of our target variable upwards. Finally, the data was split into training validation and test in 16 20 20 ratio using formula random method. Over to you now, Kamal. Kamal Krishnan Sorry, I am Kamal. I will explain more about the different models built in JMP using the data set. We in total built eight different types of model. On each type of model, we tried various input configuration and settings to improve the results of mainly sensitivity. As our target was to reduce the number of false negatives in the classification. JMP is very user friendly to redo the models by changing the configurations. It was easy to store the results whenever a new iteration of the model is done in JMP and then compare outputs in order to select the optimized model from each type. JMP allowed us to even change the cutoff values from default 0.5 to others and observed the prediction results. This slide shows the results of selected model from eight different type of models. First, as our top target variable journeys categorical we built logistic regression. Then we build decision tree, KNN, ensemble models like Bootstrap forest and boosted tree. Then we built machine learning models like neural networks. JMP allowed us to set the random seed in models like neural networks and KNN. This helped us to get the same outputs we needed. Then we built naive Bayes model. JMP allowed us to study the impact of various variables through prediction profiler. We can point and click on to change the values in the range and see how it impacts the target variable. By changing the prediction profiler in naive bayes, we observed that increase in tenure period helps in reducing the churn rate. On the contrary, increase in monthly charges increases the churn rate. Finally, we did ensembel of different combination of models to average and eliminate the shortcomings of individual models. We found that in ensembling neural network and naive bayes has higher sensitivity among ???. This ends the model description. Over to you, Jimmy. JJoseph Thank you, Kamal. In this section we will be comparing the models and looking deeper dive into each model detail. The major parameters used to compare the models are cost of misclassification in dollars, sensitivity versus accuracy chart, lift ratio, and area under the curve values. The cost of misclassification data is depicted on the right, top corner of the slide. Cost of false positives and false negative determined using average monthly charges. That cost of false negative model predicted no turn for customer potentially leaving, calculated to dollar (85) and cost of false negative at dollar (14) after discounting 20% to accommodate additional benefits. The cost comparison chart clearly indicate that the niave bayes has the lowest cost. Going on to total accuracy rates chart with it is between 74 to 81%, not much variation in most of the models. And lift, a measure of probability to find a success record compared to baseline model, varies between 1.99 to 3.11. The AUC or ROC curve is another measure us to determine the strength of the model with different type of values. As chart indicates all the models did equally well in this category. The sensitivity and accuracy chart measure the models' success to predict the customer churn accurately. The chart indicates two facts How many customers that the model can correctly predict; to how often the prediction be accurate. This measure is used as the major parameter to decide the best performing model and naive bayes did well in this category. Based on the various metrics and considering the cost of failed prediction of models, naive bayes came out as the best and parsimonious model to predict the customer churn for the given data set. It has lowest misclassification ratio, high sensitivity, and reasonably good total accuracy. If you discount some of its inherent drawbacks, such as lack of a statistical model to support, the model is completely data driven and easily explainable. Moving on to the conclusions drawn, the significant variables in the data set are contract and tenure of customer enrolled. From modeling, we observed that churning of customer is high for 1) those without dependent in demography; 2) those who pay a high price for their phone services, low customer satisfaction rate on high end services; 3) customers stick to the original single line on service easy switch over to competitors. So based on those findings, the recommendations are 1) targeted customer promotion focused on in income generation; 2) push long term contract with additional incentives; 3) build a product line combo focusing on customer needs. In conclusion, we use JMP tool to do analysis and predictive models on limited data set. It is very effective and powerful to to do those analysis, please reach out to us if you have any further questions. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

Steve Hampton, Process Control Manager, PCC Structurals Jordan Hiller, JMP Senior Systems Engineer, JMP Many manufacturing processes produce streams of sensor data that reflect the health of the process. In our business case, thermocouple curves are key process variables in a manufacturing plant. The process produces a series of sensor measurements over time, forming a functional curve for each manufacturing run. These curves have complex shapes, and blunt univariate summary statistics do not capture key shifts in the process. Traditional SPC methods can only use point measures, missing much of the richness and nuance present in the sensor streams. Forcing functional sensor streams into traditional SPC methods leaves valuable data on the table, reducing the business value of collecting this data in the first place. This discrepancy was the motivator for us to explore new techniques for SPC with sensor stream data. In this presentation, we discuss two tools in JMP — the Functional Data Explorer and the Model Driven Multivariate Control Chart — and how together they can be used to apply SPC methods to the complex functional curves that are produced by sensors over time. Using the business case data, we explore different approaches and suggest best practices, areas for future work and software development. Auto-generated transcript... Speaker Transcript Jordan Hiller Hi everybody. I'm Jordan Hiller, senior systems engineer at JMP, and I'm presenting with Steve Hampton, process control manager at PCC Structurals. Today we're talking about statistical process control for process variables that have a functional form. And that's a nice picture right there on the title slide. We're talking about statistical process control, when it's not a single number, a point measure, but instead, the thing that we're trying to control has the shape of a functional curve. Steve's going to talk through the business case, why we're interested in that in a few minutes. I'm just going to say a few words about methodology. We reviewed the literature in this area for the last 20 years or so. There are many, many papers on this topic. However, there doesn't really appear to be a clear consensus about the best way to approach this statistical process control when your variables take the form of a curve. So we were inspired by some recent developments in JMP, specifically the model driven multivariate control chart introduced in JMP 15 and the functional data explorer introduced in JMP 14. Multivariate control charts are not really a new technique they've been around for a long time. They just got a facelift in JMP recently. And they use either principal components or partial least squares to reduce data, to model and reduce many, many process variables so that you can look at them with a single chart. We're going to focus on the on the PCA case, we're not really going to talk about partial the partial least squares here. Functional Data Explorer is the method we use in JMP in order to work with data in the shape of a curve, functional data. And it uses a form of principal components analysis, an extension of principal components analysis for functional data. So it was a very natural kind of idea to say what if we take our functional curves, reduce and model that using the functional data explorer. The result of that is functional principal components and just as you you would add regular principal components and push that through a model driven multivariate control chart, what if we could do that with a functional principal components? Would that be feasible and would that be useful? So with that, I'll turn things over to Steve and he will introduce the business case that we're going to discuss today. 1253****529 All right. Thank you very much. Jordan. Since I do not have video, I decided to let you guys know what I look like. There's me with my wife Megan and my son Ethan with last year's pumpkin patch. So I wanted to step into the case study with a little background on what I do, and so you have an idea of where this information is coming from. I work in investment casting for precision casting... Investment Casting Division. Investment casting involves making a wax replicate of what you want to sell, putting it into a pattern assembly, dipping it multiple times in proprietary concrete until you get enough strength to be able to dewax that mold. And we fire it to have enough strength to be able to pour metal into it. Then we knock off our concrete, we take off the excessive metal use for the casting process. We do our non destructive testing and we ship the part. The drive for looking at improved process control methods is the fact that Steps 7, 8, and 9 take up 75% of the standing costs because of process variability in Steps 1-6. So if we can tighten up 1-6, most of ??? and cost go there, which is much cheaper, much shorter, then there is a large value add for the company and for our customers in making 7, 8, and 9 much smaller. So PCC Structurals. My plant, Titanium Plant, makes mostly aerospace components. On the left there you can see a fan ??? that is glowing green from some ??? developer. And then we have our land based products, which right there's a N155 howitzer stabilizer leg. And just to kind of get an idea where it goes. Because every single airplane up in the sky basically has a part we make or multiple parts, this is an engine sections ???, it's about six feet in diameter, it's a one piece casting that goes into the very front of the core of a gas turbine engine. This one in particular is for the Trent XWB that powers the Airbus A350 jets. So let's get into JMP. So the big driver here is, as you can imagine, with something that is a complex as an investment casting process for a large part, there is tons of data coming our way. And more and more, it's becoming functional as we increase the number of centers, we have and we increase the number of machines that we use. So in this case study, we are looking at data that comes with a timestamp. We have 145 batches. We have our variable interest which is X1. We have our counter, which is a way that I've normalized that timestamp, so it's easier to overlay the run in Graph Builder and also it has a little bit of added niceness in the FTP platform. We have our period, which allows us to have that historic period and a current period that lines up with the model driven multivariate control chart platform, so that we can have our FDE only be looking at the historic so it's not changing as we add more current data. So this is kind of looking at this if you were in using this in practice, and then the test type is my own validation attempts. And what you'll see here is I've mainly gone in and tagged thing as bad, marginal or good. So red is bad, marginal is purple, and green is good and you can see how they overlay. Off the bat, you can see that we have some curvey ??? curves from mean. These are obviously what we will call out of control or bad. This would be what manufacturing called a disaster because, like, that would be discrepant product. So we want to be able to identify those earlier, so that we can go look at what's going on the process and fix it. This is what it looks like breaking out so you can see that the bad has some major deviation, sometimes of mean curve and a lot of character towards the end. The marginal ones are not quite as deviant from the mean curves but have more bouncing towards the tail and then good one is pretty tight. You can see there's still some bouncing. So this is where the the marginal and the good is really based upon my judgment, and I would probably fail an attribute Gage R&R based on just visually looking at this. So we have a total of 33 bad curves, 45 marginal and 67. And manually, you can just see about 10 of them are out. So you would have an option if you didn't want to use a point estimate, which I'll show a little bit later that doesn't work that great, of maybe making... control them by points using the counter. And how you do that would be to split the bad table by counter, put it into an individual moving range control chart through control chart building and then you would get out, like 3500 control charts in this case, which you can use the awesome ability to make combined data tables to turn that that list summary from each one into its own data table that you can then link back to your main data table and you get a pretty cool looking analysis that looks like this, where you have control limits based upon the counters and historic data and you can overlay your curves. So if you had an algorithm that would tag whenever it went outside the control limits, you know, that would be an option of trying to have a control.... a control chart functionality with functional data. But you can see, especially I highlighted 38 here, that you can have some major deviation and stay within the control limits. So that's where this FDE platform really can shine, in that it can identify an FPC that corresponds with some of these major deviations. And so we can tag the curves based upon those at FPCs. And we'll see that little later on. So, using the FDE platform, it's really straightforward. Here for this demonstration, we're going to focus on a step function with 100 knots. And you can see how the FPCs capture the variability. So the main FPC is saying, you know, beginning of the curve, there's...that's what's driving the most variability, this deviation from the mean. And setup is X1 and their output, counters. Our input, batch number and then I added test type. So we can use that as some of our validation in FPC table and the model driven multivariate control chart and the period so that only our historic is what's driving the FDE fit. And so just looking at the fit is actually a pretty important part of making sure you get correct control charting later on, is I'm using this P Step Function 100 knots model. You can see, actually, if I use a B spline and so with Cubic 20 knots, it actually looks pretty close to my P spline. But from the BIC you can actually see that I should be going to more knots, so if I do that, now we start to see them overfitting, really focusing on the isolated peaks and it will cause you to have an FDE model that doesn't look right and causes you to not be as sensitive and your model driven multivariate control chart. 0

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

Sports analytics tools are becoming more frequently used to help athletes enhance their skills and body strength to perform better and prevent injury. ACL tearing is one of the most common and dangerous injuries in basketball history. This injury occurs most frequently in jumping, landing, and pivoting due to the rapid change of direction and/or sudden deceleration in basketball. Recovering from an ACL injury is a brutal process, can take months – even years – to recover, and significantly decrease the player’s performance after recovery. The goal of this project is to find the relationship between fatigue and different angle measurements in the hips, knees, and back as well as the force applied to the ground to minimize the ACL injury risk. 7 different sensors were attached to a test subject while he conducted the countermovement jump for 10 trials on each leg before and after 2 hours of vigorous exercise. The countermovement jump was chosen due to its ability to assess the ACL injury risk quite well through force and flexion of different body parts. Several statistical tools such as the control chart builder, multivariate correlation, and variable clustering were utilized to discover any general insights between the before and after fatigue state for each exercise (which is related to an increased ACL injury risk). The JMP Multivariate SPC platform provided further biomechanic, time-specific information about how joint flexions differ before and after fatigue at specific time points, giving a more in-depth understanding of how the different joint contributions change when fatigued. The end-to-end experimental and analysis approach can be extended across different sports to prevent injury. (view in My Videos) Auto-generated transcript:

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

Stanley Siranovich, Principal Analyst, Crucial Connection LLC Much has been written in both the popular press and in the scientific journals about the safety of modern vaccination programs. To detect possible safety problems in U.S.-licensed vaccines, the CDC and the FDA have established the Vaccine Adverse Event Reporting System (VAERS). This database system now covers 20 years, with several data tables for each year. Moreover, these data tables must be joined to extract useful information from the data. Although a search and filter tool (WONDER) is provided for use with this data set, it is not well suited for modern data exploration and visualization. In this poster session, we will demonstrate how to use JMP Statistical Discovery Software to do Exploratory Data Analysis for the MMR vaccine over a single year using platforms such as Distribution, Tabulate, and Show Header Graphs. We will then show how to use JMP Scripting Language (jsl) to repeat, simply and easily, the analysis for additional years in the VAERS system. Auto-generated transcript... Speaker Transcript Stan Siranovich Good morning everyone. Today we're going to do a exploratory data analysis of the VAERS database. Now let's do a little background on what this database is. VAERS, spelled V-A-E-R-S, is an acronym for Vaccine Adverse Effect Reporting System. It was created by the FDA and the CDC. It gets about 30,000 updates per year and it's been public since 1990 so there's quite a bit of data on it. And it was designed as an early warning system to look for some effects of vaccines that have not previously been reported. Now these are adverse effects, not side effects, that is they haven't been linked to the vaccination yet. It's just something that happened after the vaccination. Now let's talk about the structure. VAERS VAX and VAERS DATA. Now there is a tool for examining the online database and it goes by the acronym of WONDER. And it is traditional search tool where you navigate the different areas of the database, select the type of data that you want, click the drop down, and after you do that a couple of times, or a couple of dozen times, what you do is send in the query. And without too much latency, get a result back. But for doing exploratory data analysis and some visualizations, there's a slight problem with that. And that is that you have to know what you want to get in the first place, or at least at the very good idea. So that's where JMP comes in. And as I mentioned, we're going to do an EDA and some visualization on on specific set of data, that is data for the MMR vaccine for measles, mumps, and rubella. And we're going to do for the most recent full year available, which will be 2019. So let me move to a new window. Okay, the first thing we did and which I omitted here was to download the CSVs and open them up in JMP. Now I want to select my data and JMP makes it very easy. After I get the window open, I simply go through rows, rows selection and select where and down here is a picture that I want the VAX_TYPE and I wanted it to equal MMR. Now there's some other options here besides equals, which we'll talk about in a second. And after we click the button, and we've selected those rows, the next thing we want to do is decide on which data that that we want. So I've highlighted some of the columns and in a minute or so you'll see why. And then when I do that, oh, before we go there, let's note row nine and row 18 right here. Notice we have MMRV and MMR. MMRV is a different vaccine. And if we wanted to look at that also, we could have selected contains here from the drop down. But that's not what we wanted to do. So we click OK and we get our table. Now what we want to do is join that VAERS VAX table which contains data about the vaccine, such as a manufacturer, the lot and so forth with the VAERS DATA table, which contains data on on the effects of vaccine, so it's it's got things like whether or not the patient had allergies, whether or not the patient was hospitalized, number of hospital days, that sort of thing. And it also contains demographic data such as age and sex. So what we want to do is join and simply go to tables join and we select The VAERS VAX and VAERS DATA tables and we want to join them on the VAERS ID. And again, JMP makes it pretty easy. We just click the column in each one at one of the separate tables and we put them here in the match window and after that we go over to the table windows and we select the columns that we want. And this is what our results table looks like. Now let me reduce that and open up and JMP table. There we go, and I'll expand that. And for the purposes of this demonstration I just selected these...these columns here. We've got the VAERS ID, which you see identification obviously, the type which are all MMR. And looks like Merck is the manufacturer. And there's a couple of unknowns scattered through here. And I selected VAX LOT, because that would be important if there's something the matter with one lot, you want to be able to see that. This looks like cage underscore year, but that is calculated age in years. There are several H columns and I just selected one. And I selected sex because we'd like to know if somebody is is more affected, if males are more affected than females or vice versa. And HOSPDAYS is the number of days in the hospital if they had an adverse effect that was severe enough to put them into the hospital. And NUMDAYS is the number of days between vaccination and the appearance of the adverse effects and it looks like we we have quite, quite a range right here. So let's, let's get started on our analysis. show header graphs. So I'm going to click on that and show header graphs. And we get some distribution, and some other information up here. We'll skip the ID and see that the VAX_TYPE is all MMR, you have no others there. And the vax manufacturer, yes, it's either a Mercks & Co Inc or unknown and one nice feature about this is we can click on the bar and it will highlight the rows for us and click away and it's unhighlighted. Moving on to VAX_LOT, we have quite a bit of information squeezed into this tiny little box here. First of all, we have the top five lots in occurrence in in our data table and here they are, and here's how many times they appear. And it also tells us that we have 413 other lots included in table, plus five by my calculation, that's something like 418 individual lots. Now we go over the calculated age in years and in we see most of our values are between zero and whatever, they're during zero bin, which makes sense because it is a vaccination and we'll just make a note of that. And we go over to the sex column and it looks like we have significantly more females than males. Now, that tells us right away if we want to do, side by side group comparisons, we're going to have to randomly select from females, so that they equal the males and we also have some unknowns here, quite a few unknowns. And we simply note that and move on. And we see hospital days. And we've see NUMDAYS. Now here's another really really nice feature. Let's say we'd like more details and we want to do a little bit of exploration to see how the age is distributed, we simply right click, right click, select open in distribution. And here we are in the distribution windows, but quite a bit of information here. For our purposes right now, we don't really do much here about the quantiles. So let's click close and it's still taking up some space. So let's go down here and select outline close orientation and let's go with vertical. And we're left with a nice easy to read window. It's got some information in there. We of course see our distribution down here and we've got a box and whisker plot up here. There's not a whole lot of time to go into that, that, that just displays data in a different way. And we see from our summary statistics that the mean happens to be 16.2, with the standard deviation 20.6. Not an ideal situation. So if you want to do anything more with that, we may want to split the years in two groups where most of them are down here and and then where, where this, where all the skewed data is and then the rest of them and along the right and examine that separately. And I will minimize that window and we can do the same with hospital days and number of days. And let me just do that real quick. And here we see the same sorts of data and I won't bother clicking through that and reducing it. But we might note also when again we have the mean of 6.7 and standard deviation of 13.2, again, not a very ideal situation and we simply make note of that. And I will close that. Now let's say we want to do a little bit more exploratory analysis, something caught our eye and all that. And that is simple to do here. We don't have to go back to the online database, and select through everything, click the drop downs, or whatever. We can simply come up here to analyze and fit Y by X. So let's say that we would like to examine the relationship between oh, hospital days, number of days spent in the hospital and calculated age in years. We simply do that. We have two continuous variables so we're going to get a bivariate plot out of that. We click OK. And we get another nice display of the data. And yes, we can see that currently, the mean is down around 5 or 6, which is a good, good thing better than 10 or 12. We can, for purposes of references, go up here to the red triangle, select fit mean and we get the mean, right here. And we noticed there's quite a few outliers. Let's say we want to examine them right now and decide whether or not we want to delve into them a little bit further. So if we hover over one of our outlier points or any of the points for that matter, we see we get this pop up window and it tells us that particular data point represents row 868. Calculated age is in the one year bucket, and this patient happened to spend 90 days in the hospital. Now we could right click and color this row or put some sort of marker in there. I won't bother doing that, but I will move the cursor over here into the window, and we see this little symbol up in the right hand corner, click that and that pins it. So we can, of course, repeat that. And we can get the detail for further examination. I found this to be quite handy when giving presentations to groups of people like to call attention to one particular point. That's a little bit overbearing so let's right click, select...select font, not edit. And we get the font window come up and see we're using 16 point font. Let's, I don't know, let's go down to 9. And that's a little bit better and it gives us more room if we'd like to call attention to some of the other outliers. So in summary, let me bring up the PowerPoint again. In summary, we were able to import and shape two large data tables from a large online government maintained database. We were able to subset tables, able to join the tables and select our output data all seamlessly. And we were able to generate summaries and distributions, pointing out the areas that may be of interest and for more detailed analysis. And of course, that was all seamless and all occured with within the same software platform. Now, supply some links right over here to the various data site. This, this is the main site, which has all the documentation that the government did quite a good job there. And here is the actual data itself in the zip..

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

Ruth Hummel, JMP Academic Ambassador, SAS Rob Carver, Professor Emeritus, Stonehill College / Brandeis University Statistics educators have long recognized the value of projects and case studies as a way to integrate the topics in a course. Whether introducing novice students to statistical reasoning or training employees in analytic techniques, it is valuable for students to learn that analysis occurs within the context of a larger process that should follow a predictable workflow. In this presentation, we’ll demonstrate the JMP Project tool to support each stage of an analysis of Airbnb listings data. Using Journals, Graph Builder, Query Builder and many other JMP tools within the JMP Project environment, students learn to document the process. The process looks like this: Ask a question. Specify the data needs and analysis plan. Get the data. Clean the data. Do the analysis. Tell your story. We do our students a great favor by teaching a reliable workflow, so that they begin to follow the logic of statistical thinking and develop good habits of mind. Without the workflow orientation, a statistics course looks like a series of unconnected and unmotivated techniques. When students adopt a project workflow perspective, the pieces come together in an exciting way. Auto-generated transcript... Speaker Transcript So welcome everyone. My name is 00 07.933 3 Ambassador with JMP. I am now a retired professor of Business 00 30.566 7 between a student and a professor working on a project. 00 49.700 11 12 engage students in statistical reasoning, teach that 00 12.433 16 to that, current thinking is that students should be learning about reproducible workflows, 00 36.266 21 elementary data management. And, again, viewing statistics as 00 58.800 25 26 wanted to join you today on this virtual call. Thanks for having 00 20.600 30 and specifically in Manhattan, and you'd asked us so so you 00 36.433 34 And we chose to do the Airbnb renter perspective. So we're 00 51.733 38 expensive. So we started filling out...you gave us 00 09.166 43 44 separate issue, from your main focus of finding a place in 00 36.066 49 you get...if you get through the first three questions, you've 00 54.100 53 know, is there a part of Manhattan, you're interested in? 00 11.133 58 repository that you sent us to. And we downloaded the really 00 26.433 32.866 63 thing we found, there were like four columns in this data set 00 46.766 67 figured out so that was this one, the host neighborhood. So 00 58.100 71 72 figured out that the first two just have tons of little tiny 00 13.300 76 Manhattan. So we selected Manhattan. And then when we had 00 29.700 80 that and then that's how we got our Manhattan listings. So 00 44.033 84 data is that you run into these issues like why are there four 00 03.300 88 restricted it to Manhattan, I'll go back and clean up some 00 18.033 92 data will describe everything we did to get the data, we'll talk 00 28.400 33.200 97 know I'm supposed to combine them based on zip, the zip code, 00 47.166 101 102 107 columns, it's just hard to find the 00 09.366 106 them, so we knew we had to clean that up. All right, we also had 00 27.366 111 journal of notes. In order to clean this up, we use the recode 00 45.500 115 Exactly. Cool. Okay, so we we did the cleanup 00 02.200 119 Manhattan tax data has this zip code. So I have this zip code 00 19.300 123 day of class, when we talked about data types. And notice in the 00 42.300 128 the...analyze the distribution of that column, it'll make a funny 00 03.200 133 Manhattan doesn't really tell you a thing. But the zip code clean data in 00 18.466 23.266 139 just a label, an identifier, and more to the point, when you want to join or merge 00 41.833 48.766 145 important. It's not just an abstract idea. You can't merge 00 03.166 11.266 150 nominal was the modeling type, we just made sure. 00 26.200 31.033 155 about the main table is the listings. I want to keep 00 45.533 159 to combine it with Manhattan tax data. Yeah. Then what? Then we need to 00 03.266 164 tell it that the column called zip clean, zip code clean... Almost. There we go. And the column called zip, which 00 33.200 171 172 Airbnb listing and match it up with anything in 00 57.033 177 178 them in table every row, whether it matches with the other or 00 13.233 182 main table, and then only the stuff that overlaps from the second 00 29.600 186 another name like, Air BnB IRS or something? Yeah, it's a lot 00 50.966 190 do one more thing because I noticed these are just data tables scattered around 00 06.666 195 running. Okay. So I'll save this data table. Now what? And really, this is the data 00 19.833 22.033 26.266 35.466 203 anything else, before we lose track of where we are, let's 00 49.733 58.800 01.833 209 or Oak Team? And then part of the idea of a project 00 23.700 214 thing. So if you grab, I would say, take the 00 50.100 218 219 220 two original data sets, and then my final merged. Okay Now 00 16.200 225 them as tabs. And as you generate graphs and 00 36.566 229 230 231 even when I have it in these tabs. Okay, that's really cool. 00 58.833 02.500 236 right, go Oak Team. Well, hi, Dr. Carver, thanks so 00 19.233 240 you would just glance at some of these things, and let me know if 00 32.300 244 we used Graph Builder to look at the price per neighborhood. And 00 45.400 248 help it be a little easier to compare between them. So we kind 00 01.000 252 have a lot of experience with New York City. So we plotted 00 18.166 256 stand in front of the UN and take a picture with all the 00 31.733 260 saying in Gramercy Park or Murray Hill. If we look back at the 00 46.566 265 thought we should expand our search beyond that neighborhood to 00 58.766 269 270 just plotted what the averages were for the neighborhoods but 00 14.533 274 the modeling, and to model the prediction. So if we could put 00 30.766 279 expected price. We started building a model and what we've 00 42.800 283 factors. And so then when we put those factors into just a 00 58.833 287 more, some of the fit statistics you've told us about in class. 00 15.466 292 but mostly it's a cloud around that residual zero line. So 00 30.766 296 which was way bigger than any of our other models. So we know 00 45.800 300 reasons we use real data. Sometimes, this is real. This is 00 58.266 304 looking? Like this is residual values. 00 19.266 309 is good. Ah, cool. Cool. Okay, so I'll look for 00 34.966 313 is sort of how we're answering our few important questions. And 00 47.300 317 was really difficult to clean the data and to join the data. 00 57.866 03.500 322 wanted to demonstrate how JMP in combination with a real world 00 28.700 327 Number one in a real project, scoping is important. We want to 00 47.600 331 hope to bring to the to the group. Pitfall number two, it's vital to explore the 00 08.033 336 the area of linking data combining data from multiple 00 27.800 341 recoding and making sure that linkable 00 45.100 345 346 reproducible research is vital, especially in a team context, especially for projects that may 00 05.966 351 habits of guaranteeing reproducibility. And finally, we hope you notice that in these 00 32.633 356 on the computation and interpretation falls by the 00 51.900 360

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

Nascif Neto, Principal Software Developer, SAS Institute (JMP Division) Lisa Grossman, Associate Test Engineer, SAS Institute (JMP division) The JMP Hover Label extensions introduced in JMP 15 go beyond traditional details-on-demand functionality to enable exciting new possibilities. Until now, hover labels exposed a limited set of information derived from the current graph and the underlying visual element, with limited customization available through the use of label column properties. This presentation shows how the new extensions let users implement not only full hover label content customization but also new exploratory patterns and integration workflows. We will explore the high-level commands that support the effortless visual augmentation of hover labels by means of dynamic data visualization thumbnails, providing the starting point for exploratory workflows known as data drilling or drill down. We will then look into the underlying low-level infrastructure that allows power users to control and refine these new workflows using JMP Scripting Language extension points. We will see examples of "drill out" integrations with external systems as well as how to build an add-in that displays multiple images in a single hover label. Auto-generated transcript... Speaker Transcript Nascif Abousalh-Neto Hello and welcome. This is a our JMP discovery presentation from details on demand to wandering workflows, getting to know JMP hover label extensions. Before we start on the gory details, we always like to talk about the purpose of a new feature introduced in JMP. So in this case, we're talking about hover labels extensions. And why do we even have hover labels in the first place. Well, I always like to go back to the visual information seeking mantra from Ben Shneiderman, which is he tried to synthesize overview first, zoom and filter, and then details on demand. Well hover labels are all about details on demand. So let's say I'm looking at this bar chart on this new data set and in JMP, up to JMP 14, as you hover over a particular bar in your bar chart, it's going to pop up a window with a little bit of textual data about what you're seeing here. Right. So you have labeled information, calculated values, just text, very simple. Gives you your details on demand. But what if you could decorate this with visualizations as well. So for example, if you're looking at that aggregated value, you might want to see the distribution of the values that got that particular calculation. Or you might want to see a breakdown of the values behind the that aggregated value. This is what we're gonna let you know with this new visualization, with this new feature. But on top of that, it's the famous, wait, there is more. This new visualization basically allows you to go on and start the visual exploratory workflow. If you click on it, you can open it up in its own window, which allows you to which can also have its visualization, which you can also click and get even more detail. And so you go down that technique called the drill down and eventually, you might get to a point where you're decorating a particular observation with information you're getting from maybe even Wikipedia in that case. Not going to go into a lot of details. We're going to learn a lot about all that pretty soon. But first, I also wanted to talk a little bit about the design decisions behind the implementation of this feature. Because we wanted to have something that was very easy to use that didn't require programming or, you know, lots of time reading the manual and we knew that would satisfy 80% of the use cases. But for those 20% of really advanced use cases or for those customers that know their JSL and they just want to push the envelope on what JMP can do, we also want to make available, something that you could do through programming. But basically, your top of the context of ??? on those visual elements. So we decided to go with architectural pattern called plumbing and porcelain, and that's something we got to git source code control application, which is basically you have a layer that is very rich and because it's very rich, very complex, which gives you access to all that information and allows you to customize things that are going to happen as far as generating the visualization or what happens when you click on that visualization And on top of that, we built a layer that is more limited, its purpose driven, but it's very, very easy to do and requires no coding at all. So that's the porcelain layer. And that's the one that Lisa is going to be talking about now. Up to you. Lisa. I'm going to stop sharing and Lisa is going to take over. Lisa Grossman Okay so we are going to take a high level look at some of the features and what kind of customization system, make the graphic ??? So, let us first go through some of the basics. So by default when you hover over a data point or an element in your graph. you see information displayed for the X and Y roles used in the graph, as well as any drop down roles such as overlay and if you choose to manually manually label a column in the data table, that will also appear as a hover label. So here we have an example of a label, the expression column tha contains an image. And so we can see that image is then populated in hover label in the back. And to add a graphlet to your hover label, you have the option of selecting some predefined graphlet presets, which you can access via the right mouse menu under hover label. Now these presets have dynamic graph role assignments and derive their roles from variables used in your graph. And presets are also preconfigured to be recursive and that will support drilling down. And for preset graphlets that have categorical columns, you can specify which columns to filter by, by using the next in hierarchy column property that's in your data table. And so now I'm going to demo real quick how to make a graphlet preset. So I'm going to bring up our penguins data table that we're going to be using. And I'm going to open up Graph Builder. And I'm going to make a bar chart here. And then right clicking under hover label, you can see that there is a list of different presets to choose from, but we're going to use histogram for this example. So now that we have set our preset, if you hover over a bar, now you can see that there's a histogram preset that pops up in your hover label. And it's also... it is also filtered based on our bar here, which is the island Biscoe. And the great thing about graphlets is if I hover over this bar, I can see another graphlet. And so now you can easily compare these two graphlets to see the distribution of bill lengths for both the islands Dream and Biscoe. And then you can take it a step further and click on the thumbnail of the graphlet and it will launch a Graph Builder instance in its own window and it's totally interactive so you can open up the control panel of Graph Builder and and customize this graph further. And then as you can see, there's a local data filter already applied to this graph, and it is filtered by Biscoe, which is the thumbnail I launched. So, that is how the graphlets are filtered by. And then one last thing is that if I hover over these these histogram bars, you can see that the histogram graphlet continues on, so that shows how these graphlet presets are pre configured to be recursive. So closing these and returning back to our PowerPoint. So I only showed the example of the histogram preset but there are a number that you can go and play with. So these graphlet presets help us answer the question of what is behind an aggregated visual element. So the scatter plot preset shows you the exact values, whereas the histogram, box plot or heat map presets will show you a distribution of your values. And if you wanted to break down your graph and look at your graph with another category, then you might be interested in using a bar, pie, tree map, or a line preset. And if you'd like to examine your raw data of the table, then you can use the tabulate preset. But if you'd like to further customize your graphlet, you do have the option to do so with paste graphlets. And so paste graphlet, you can easily achieve with three easy steps. So you would first build a graph that you want to use as a graphlet. And we do want to note here that it does not have to be one built from Graph Builder. And then from the little red triangle menu, you can save the script of the graph to your clipboard. And then returning to your base graph or top graph, you can right click and under hover label, there will be a paste graphlet option. And that's really all there is to it. And we want to also note that paste graphlet will have static role assignments and will not be recursive since you are creating these graph lets to drill down one level at a time. But if you'd like to create a visualization with multiple drill downs, then you can, you have the option to do so by nesting paste graphlet operations together, starting from the bottom layer going up to your top or base later. So, and this is what we would consider our Russian doll example, and I can demo how you can achieve that. So we'll pull up our penguins data table again. And we'll start with the Graph Builder and we'll we're going to start building our very top layer for this. So let's go ahead build that bar chart. And then let's go on to build our very second...our second layer. So let's do a pie with species. And then for our very last layer, let's do a scatter plot. OK, so now I have all three layers of our...of what we will use to nest and so I will go and save the script of the scatter plot to my clipboard. And then on the pie, I right click and paste graphlet. And so now when you hover, you can see that the scatter plot is in there and it is filtered by the species in this pie. So I'm going to close this just for clarity and now we can go ahead and do the same thing to the pie, save the script, because it already has the scatter plot embedded. So save that to our clipboard, go over to our bar, do the same thing to paste graphlet. And now we have... we have a workflow that is... that you can click and hover over and you can see all three layers that pop up when you're hovering over this bar. So that's how you would do your nested paste graphlets. And so we do want to point out that there are some JMP analytical platforms that already have pre integrated graphlets available. So these platforms include the functional data explorer, process screening, principal components, and multivariate control charts, and process capabilities. And we want to go ahead and quickly show you an example using the principal components. Lost my mouse. There we go. So I launch our table again and open up principal components. And let's do run this analysis. And if I open up the outlier analysis and hover over one of these points, boom, I can see that these graphlets are already embedded into this platform. So we highly suggest that you go and take a look at these platforms and play around with it and see what you like. And so that was a brief overview of some quick customizations you can do with hover label graphlets and I'm going to pass this presentation back to Nascif so he can move you through the plumbing that goes behind all of these features. Nascif Abousalh-Neto Thank you, Lisa. Okay, let's go back to my screen here. And we... I think I'll just go very quickly over her slides and we're back to plumbing, and, oh my god, what is that? This is the ugly stuff that's under the sink. But that's where you have all the tubing and you can make things really rock, and let me show them by giving a quick demo as well. So here Lisa was showing you the the histogram... the hover label presets that you have available, but you can also click here and launch the hover label editor and this is the guy where you have access to your JSL extension points, which is where you make, which is how those visualizations are created. Basically what happens is that when you hover over, JMP is gone to evaluate the JSL block and capture that as an in a thumbnail and put that thumbnail inside your hover label. That's pretty much, in a nutshell, how it goes. And the presets that you also have available here in the hover label, right, they basically are called generators. So if I click here on my preset and I go all the way down, you can see that it's generating the Graph Builder using the histogram element. That's how it does its trick. Click is a script that is gonna react to when you click on that thumbnail, but by default (and usually people stick with the default), if you don't have anything here, it's just, just gonna launch this on its own window, instead of capturing and scale down a little image. In here on the left you can see two other extension points we haven't really talked much about yet. But we will very soon. So I don't want to get ahead of myself. So, So let's talk about those extension points. So we created not just one but three extension points in JMP 15. And they are, they're going to allow you to edit and do different functionality to different areas of your hover label. So textlets, right, so let's say for example you wanted to give a presentation after you do your analysis, but you want to use the result of that analysis and present it to an executive in your company or maybe we've an end customer that wants a little bit more of detail in in a way that they can read, but you would like make that more distinct. So textlet allows you to do that. But since you're interfacing with data, you also want that to be not a fixed block of text, but something that's dynamic that's based on the data you're hovering over. So to define a textlet, you go back to that hover label editor and you can define JSL variables or not. But if you want it to be dynamic, typically, what you do is you define a variable that's going to have the content that you want to display. And then you're going to decorate that value using HTML notation. So, here is how you can select the font, you can select background colors, foreground colors, you can make it italic, and basically make it as pretty or rich of text as you as you need to. Then the next hover labelextension is the one we call gridlet. And if you remember the original or the current JMP hover label, it's basically a grid of name value pairs. To the left, you have names of your...that would be the equivalent to your column name, and to the right, you have the values which might be just a column cell for a particular row if it's a marked plot. But if it's aggregated like a bar chart, this is going to be a mean or an average medium, something like that. The default content from here, like Lisa said before, is derived at both from the...originally is derived both from whatever labeled columns you have in your data table and also, whatever role assignments you have in your graph. So if it's a bar chart, you have your x, you have your y. You might have an overlay variable and everything that in at some point contributes to the creation of that visual element. Well with gridlets you can now have pretty much total control of that little display. You can remove entries. It's very common that sometimes people don't want to see the very first row, which has the labeles or the number of rows. Some people find that redundant. They can take it out. You can add something that is completely under your control. Basically it's going to evaluate the JSL script to figure out what you want to display there. One use case I found was when someone wanted an aggregated value for a column that was not individualization. Some people call those things hidden columns or hidden calculations. Now you can do that, right, and have an aggregation for the same rows that the rest of that that are being displayed on that visualization. You can rename. We usually add the summary statistic to the left of anything that comes from a y calculated column. If you don't like that, now you can remove it or replace it with something else. And as well...and then you can do details like changing the numeric precision or make text bold or italics or red or... even for example, you can make it red and bold, if the value is above a particular threshold. So you can have something that, as I move over here, if the value is over the average of my data I make it red and bold so I can call attention to that. And that will be automatic for you. And finally, graphlets. We believe that's going to be the most useful and used one. Certainly don't want that to cause more attention because you have a whole image inside your tool tip and we've been seeing examples with data visualizations, but it's an image. So it can be a picture as well. It can be something you're downloading from the internet on the fly by making a web call. That's how I got the image of this little penguin. It's coming straight from Wikipedia. As you hover over, we download it, scale it and and put it here. Or you can, for example, that's a very recent use case, someone had a database of pictures in the laboratory and they have pictures of the samples they were analyzing and they didn't want to put them on the data table because the data table would be too large. Well, now you can just get a column, turn that column into a file name, read from the file name, and boom, display that inside your tool tip. So when you're doing your analysis, you know, exactly, exactly what you're looking at. And just like graph...gridlets, we're talking about clickable content. So again, for example, if I wanted and I showed that when I click on this little thumbnail here, I can open a web page. So you can imagine that even as a way to integrate back with your company. Let's say you have web services that they're supported in your company, and you want to, at some point, maybe click on an image to make a call to kind of register or capture some data. Go talking for a web call to that web service. Now that's something you can do. So I like to call, we talk about drill in and drill down, that would be a drill out. That's basically JMP talking to the outside world using data content from your exploration. So let's look at those things in the little bit more detail. So those those visualizations that we see here inside the hover label, they are basically... that's applied to any visualization. Actually it's a combination of a graph destination and the data subset. So in the Graph Builder, for example, you'll say, I want the bar chart of islands by on my x axis and on my y axis, I want to show the average of the body mass of the penguins on that island. Fine. How do you translate that to a graphlet, right? Well, basically when you select the preset or when you write in your code if you want to do it, but the preset is going to is going to use our graph template. So basically, some of the things are going to be predefined like that. The bar element, although if you're writing it your own, you could even say I want to change my visualization depending on my context. That's totally possible. And you're going to fill that template with a graph roles and values and table data, table metadata. So, for example, let's say I have a preset of doing that categorical drill down. I know it's going to be a bar chart. I don't know what a bar chart is going to be, what's going to be on my y or my x axis. That's going to come from the current state of my baseline graph, for example, I'm looking at island. So I know I want to do a bar chart of another category. So that's when the next in hierarchy and the next column comes into play. I'm making that decision on the fly, based on the information that user is giving me and the graph that's being used. For example, if you look here at the histogram, it was a bar chart of island by body mass. This is a histogram of body mass as well. If I come here to the graph and change this column and then I go back and hover, this guy is going to reflect my new choice. That's this idea of getting my context and having a dynamic graph. The other part of the definition of visualization is the data subset. And we have a very similar pattern, right. We have...LDF is local data filter. So that's a feature that we already had in JMP, of course, right. And basically, I have a template that is filled out from my graph roles here. It's like if it was a bar chart, which means my x variable is going to be a grouping variable of island. I know I wanted to have a local data filter of island and that I want to select this particular value so that it matches the value I was hovering over. This happens both when you're creating the hover label and when you're launching the hover label, but when you create a hover label, this is invisible. We basically create a hidden window to capture that window so you'll never see that guy. But when you launch it, the local data filter is there and as Lisa has shown, you can interact with it and even make changes to that so that you can progress your your, your visual exploration on your own terms. So I've been talking about context, a lot. This is actually something that you should need to develop your own graphlets, you need to be familiar with. We call that hover label execution context. You're going to have information about that in our documentation and it's basically if you remember JSL, it's a local block. We've lots of local variables that we defined for you and those those variables capture all kinds of information that might be useful for someone to find in the graphlet or a gridlet or a textlet. It's available for all of those extension points. So typically, they're going to be variables that start with a nonpercent... Not a nonpercent...I'm sorry. To prevent collisions with your data table column names, so it's kinda like reserved names in a way. But basically, you'll see here that that's that's code that comes from one of our precepts. By the way, that code is available to you through the hover label editor, so you can study and see how it goes. Here we're trying to find a new column. To using our new graph, it's that idea of it being dynamic and to be reactive to the context. And this function is going to look into the data table for that metadata. My...a list of measurement columns. So if the baseline is looking at body mass, body mass is going to be here in this value and at a list of my groupings. So if it was a bar chart of island by body mass, we're going to have islands here. So those are lists of column names. And then we also have any of numeric values, anything that's calculated is going to be available to you. Maybe you want to, like I said, maybe you want to make a logical decision based on the value being above or below the threshold so that you can color a particular line red or make it bold, right. You're going to use values that we provide to you. We also provide something that allow you to go back to the data. In fact, to the data table and fetch data by yourself like the row index of the first row on the list of roles that your visual element discovering, that's available to you as well. And then the other even more data, like for example the where clause that corresponds to that local data filter that you're executing in the context of. And the drill depth, let's say, that allows you to keep track of how many times you have gone on that thumbnail and open a new visualization and so on. So for example, when we're talking about recursive visualizations, every recursion needs an exit condition, right. So here, for example, is how you calculate the exit condition of one of your presets. If I don't have anything more to to show, I return empty, means no visualization. Or if I don't have...if I only show you one value, right, or any of my drill depth is greater than one, meaning I was drilling until I got to a point where just only one value to show in some visualizations doesn't make sense. So I can return empty as well. That's just an example of the kinds of decisions that you can make your code using the hover label execution context. Now, I just wanted to kind of gives you a visual representation of how all those things come together again using the preset example. When you're selecting a preset, you're basically selecting the graph template, which is going to have roles that are going to be fulfilled from the graph roles that are in your hover label execution context. And so that's your data, your graph definition. And that date graph definition is going to be combined with the subset of observations resulting from the, the local data filter that was also created for you behind the scenes, based on the visual element you're hovering over. So when you put those things together, you have a hover label, we have a graphlet inside. And if you click on that graphlet, it launches that same definition in here and it makes the, the local data filter feasible as well. When, like Lisa was saying, this is a fully featured life visualization, not just an image, you can make changes to this guy to continue your exploration. So now we're talking, you should think in terms of, okay, now I have a feature that creates visualizations for me and allow me to create one visualization from another. I'm basically creating a visual workflow. And it's kind of like I have a Google Assistant or an Alexa in JMP, in the sense that I can...JMP is making me go faster by creating, doing visualizations on my behalf. And they might be, also they might be not, just an exploration, right. If you're happy with them, they just keep going. If you're not happy with them, you have two choices and maybe it's easier if I just show it to you. So like I was saying, I come here, I select a preset. Let's say I'm going to get a categoric one bar chart. So that gives me a breakdown on the next level. Right. And if I'm happy with that, that's great. Maybe I can launch this guy. Maybe I can learn to, whoops... Maybe I can launch another one for this feature. At the pie charts, they're more colorful. I think they look better in that particular case. But see, now I can even do things like comparing those two bar charts side by side. And let's...but let's say that if I keep doing that and it isn't a busy chart and I keep creating visualizations, I might end up with lots of windows, right. So that's why we created some modifiers to...(you're not supposed to do that, my friend.) You can just click. That's the default action, it will just open another window. If you alt-click, it launches on the previous last window. And if you control-click it launches in place. What do I mean by that? So, I open this window and I launched to this this graphlet and then I launched to this graphlet. So let's say this is Dream and Biscoe and Dream and Biscoe. Now I want to look at Torgersen as well. Right. And I want to open it. But if I just click it opens on its own window. If I alt-click, (Oh, because that's the last one. I hope. I'm sorry. So let me close this one.) Now if I go back here in I alt-click on this guy. See, it replaced the content of the last window I had open. So this way I can still compare with visualizations, which I think it's a very important scenario. It's a very important usage of this kind of visual workflow. Right. But I can kind of keep things under control. And I don't just have to keep opening window after window. And the maximum, the real top window management feature is if I do a control-click because it replaces the window. And then, then it's a really a real drill down. I'm just going on the same window down and down and now it's like okay, but what if I want to come back. Or if you want to come back and just undo. So you can explore with no fear, not going to lose anything. Even better though, even the windows you launch, they have the baseline graph built in on the bottom of the undo stack. So I can come here and do an undo and I go back to the visualizations that were here before. So I can drill down, come back, branch, you can do all kinds of stuff. And let's remember, that was just with one preset. Let's do something kind of crazy here. We've been talking, we've been looking at very simple visualizations. But this whole idea actually works for pretty much any platform in JMP. So let's say I want to do a fit of x by y. And I want to figure out how...now, I'm starting to do real analytics. How those guys fit within the selection of the species. Right. So I have this nice graph here. So I'm going to do that paste graphlet trick and save it to the clipboard. And I'm going to paste it to the graphlet now. So as you can see, we can use that same idea of creating a context and apply that to my, to my analysis as well. And again, I can click on those guys here and it's going to launch the platform. As long as the platform supports local data filters, (I should have given this ???), this approach works as well. So it's for visualizations but in...since in JMP, we have this spectrum where the analytics also have a visual component, so works with our analytics as well. And I also wanted to show here on that drill down. This is my ??? script. So I have the drill down with presets all the way, and I just wanted to go to the the bottom one where I had the one that I decorated with this little cute penguin. But what I wanted to show you is actually back on the hover label editor. Basically what I'm doing here, I'm reading a small JSL library that I created. I'm going to talk about that soon, right, and now I can use this logic to go and fetch visualizations. In this case I'm fetching it from Wikipedia using a web call. And that visualization comes in and is displayed on my visualization. It's a model dialogue. But also my click script is a little bit different. It's not just launching the guy; it's making a call to this web functionality after getting a URL, using that same library as well. So what exactly is it going to do? So when I click on the guy, it opens a web page with a URL derived from data from my visualization and this can be pretty much anything JSL can do. I just want to give us an example of how this also enables you integration with other systems, even outside of JMP. Maybe I want to start a new process. I don't know. All kinds of possibilities. That I apologize. So So there are two customized...advanced customization examples, I should say, that illustrate how you can use graphlets as a an extensible framework. They're both on the JMP Community, you can click here if you get the slides, but one is called the label viewer. I am sorry. And basically what it does is that when you hover over a particular aggregated graph, it finds all the images on the graph...on the data table associated with those rows and creates one image. And that's something customers have asked for a while. I don't want to see just one guy. I want to see if you have more of them, all of them. Or, if possible, right. So when you actually use this extension, and you click on...actually no, I don't have it installed so... And the wiki reader, which was the other one, is the one I just showed to you. Bbut was what I was saying is that when you click and launch this particular...on this particular image, it launches a small application that allows you to page through the different images in your data table and you have a filter that you can control and all that. This is one that was completely done in JSL on top of this framework. So just to close up, what did we learn today? I hope that you found that it's now very easy to add visualizations, you can visualize your visualizations, if you will. It's very easy to add those data visualization extensions using the porcelain features. You actually have not just richer detail on your thumbnails, but you have a new exploratory visual workflow, which you can customize to meet your needs by using either paste graphlet, if you want to have something easy to do, or you can even use JSL using the hover label editor. We're both very curious to see what you've...how you guys are going to use that in the field. So if you come with some interesting examples, please call us back. Send us a screenshot in the JMP Community and let us know. That's all we have today. Thank you very much. And when we give this presentation, we're going to be here for Q&A. So, thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

Meijian Guan, JMP Research Statistician Developer, SAS Single-cell RNA-sequencing technology (scRNA-seq) provides a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment. Recently, it has been used to combat COVID-19 by characterizing transcriptional changes in individual immune cells. However, it also poses new challenges in data visualization and analysis due to its high dimensionality, sparsity, and varying heterogeneity across cell populations. JMP Project is a new way to organize data tables, reports, scripts as well as external files. In this presentation, I will show how to create an integrated Basic scRNA-seq workflow using JMP Project that performs standard exploration on a scRNA-seq data set. It first selects a set of high variable genes using a dispersion or a variance-stabilizing transformation (VST) method. Then it further reduces data dimension and sparsity by performing a sparse SVD analysis. It then generates an interactive report that consists of data overview, variable gene plot, hierarchical clustering, feature importance screening, and a dynamic violin plot on individual gene expression levels. In addition, it utilizes the R integration feature in JMP to perform t-SNE or UMAP visualizations on the cell populations if appropriate R packages are installed. Auto-generated transcript... Speaker Transcript Meijian Guan All right. Um, hi, everyone. Thank you so much for attending this presentation. I'm so happy that I have this opportunity to share the work I have have been doing with JMP Life Science group and SAS Institute. So today's topic is going to be building a single-cell RNA-sequencing workflow with JMP Project. So this is a new feature I developed for JMP Genomics 10. If you don't know what is JMP Genomics, I will give you a brief overview about it and the JMP project is a new feature, released on 14 and it's very nice tool can help you to organize your reports. So we took advantage of this new platform and organized a single-cell RNA-sequencing workflow into it. So first of all, I just want to give you a little bit background about JMP, JMP Genomics is one of the products from JMP family is built on top of SAS and JMP Pro. So it's taking advantage of both products which makes a very powerful analytical tool. So it's designed for genomic data, so it can read in different types of genomic data, it can do preprocessing, it can handle next generation sequencing that analysis. It is really good at differential gene expression and biomarker discovery, and many scientists using it for crop and livestock breeding. So it's a very powerful tool. I encourage everyone to check it out if you are doing anything related to genomics. And next thing I want to share with you is the single-cell RNA sequencing. Many of you may not be very familiar with it. So this is a relatively new technology used to examine that on a level from individual cells. And comparing to the traditional RNA sequencing technology which is survey, the average expression level of a group of cells. This, this new technology provides a higher resolution of cellular differences and it gives you a better understanding of the function of the individual cell in the context of its micro environment. And it can help to do a lot of stuff like uncover new and rare cell populations, track trajectories of cell development, and identify differentially expresed genes between cell types. So it has very wide application. One application recently is scientists using it to combat Covid 19 so it because it can be used to characterizing transcriptional changes in immune cells and how to develop the vaccines and treatment. Also in addition to that, it's widely used in cancer research and widely used in immunology and in many other research fields. Um, so it's very powerful tool, but it does have some challenges to analyze that data, so that's why we put together this workflow. Just wanted to give you an overview of the top line of the single-cell RNA sequencing. So the first thing you need is to get a sample, either from human or from animals. It could be a tumor or lab sample. And then you can isolate those samples into individual cells. So after you isolate, you can do sequencing on every individual cell for all the genes you have. For example, in humans, we have about 30,000 genes. So the final product will look like this in our read count table. We have genes...30,000 genes in rows and we have about sometimes half million cells as columns. So as you can see, meeting these very large data set has very high dimensions. Also you can notice the zeros in the table because Because of the technical or biological limitations, there's no way we can detect every single gene in every single cell. So it's not uncommon to see 90% of cells actually are zeros. So it's very sparse. Sparsity is another challenge when you analyze single cell RNA sequencing data. But after you do preprocessing, cleaning up, and do dimension reduction, you can apply regular, like clustering and principal components, differential gene expression analysis on this data. So those will be mentioned in my workflow. And I already mentioned this out that that I noticed challenges, including high dimensionality, high sparsity, and also there are varying heterogeneity across cell populations. Technical noises and reproducibility, since there are so many different sequencing protocols so many different analytical packages. In R or Python or other tools, it's very hard for you to follow exact steps to analyze your data. And if you mixed up the steps and didn't do things in correct order, you may not be able to get a reproducible results. So that's one of the problems that we tried to solve here. Just want to show you an example of single-cell RNA-sequencing data. This data will be used in my demonstration and it's a reduced blood sample data or ppm. See that I said we have cells in rows and genes in columns so it's about 8000 columns and 100 rows, which would mean cells. And you can see those zeros, pretty much everywhere. I, I believe it's more than 90% of sparsity in this specific data set. So what's in our new single-cell RNA-sequencing workflow in JMP Genomics 10? So for this workflow, we tried to build it for those people who do not have very good technical background or not, do not have time to learn how to code and all those statistics. So in this workflow we put those steps in the right order for users to automatically execute the other steps in the workflow. And we also provide a very interactive reports to help users navigate with us and change the parameters and check outs different selections. So what's in this workflow, including data import progress, preprocessing and we have a variable gene selection method, which is the backbone of this workflow actually. So it for variable gene selection that the goal for this method is to reduce a dimension of the genes. Because for humans sample, we have 30,000 genes and not all of them are informative. So we try to pick the most informative ones. dispersion method and variability stabilizing transformation method based on lowest regression. So, I will not go into the details, but these two methods are widely used the research community and I'm pretty happy that we were able to reproduce them. And we also apply sparse SVD to further reduce the dimensions. And so we also applied hierarchical clustering and a k means clustering. We have feature importance screening using a boosting forest method in JMP and if you have R packages installed, we are directly call out to T-SNE and UMAP visualization which is very popular using a single-cell RNA-sequencing analysis. And we also provided some dynamic visualizations including violin plot, ridgeline plot, dop plot; we also do differential gene expression. And so all the reports will be organized in a very integrated reports with JMP project. So next I will do a demo. There are two goals in this demo. First one is to classify the cell populations in this PBMC data set. We try to find what are the cell types in this data set. The second goal is just to identify differentially expressed jeans across subtypes and conditions. So first of all, let's go to JMP Genomics starter. So JMP Genomics interface looks quite different from regular JMP but it's pretty easy to navigate. If you want to find the workflows basics and basic single-cell RNA-sequencing workflow lives here, you click that you can bring up this interface. So the interface is pretty intuitive. I'll say you just provide a data set. And you specify the QC options. What, what kinds of genes or cells do you want to remove for your analysis and variable gene selections. Which one method that you want to use, right. If you select a VST, you can also specify the number of genes you want to keep, 2000 or 3000 And the clustering options, right, how many principal components you want to use for the clustering and either you wants hierarchical or k means clustering algorithms. And the more options, we have marker genes to help you to add a list of marker genes you want to use to identify the cell populations, which is very handy tool here. And you can launch ANOVA and differential expression analysis. So this is a separate report. I will not discuss this in this talk. So another thing we had is experiment example, right. If you add that basically you can provide any information related to start design like treatment information, sex information. So this is the simulated data here. I would just want to show you how we're gonna compare the gene expression levels and different measurements between groups. And finally, we have embedding options which can call out to t-SNE or UMAP R packages if you have them installed. You can change different parameters for this to our algorithms. So after you specify all those options, just go to run and then you have the report that looks like this one. So this is a tabular report. There are a total of seven tabs in this report. I organized them in the order that you want to how many genes in in the cells, how many read counts or what's the percentage of mitochondria gene counts in your data and the correlations between this three measurements. And we, you notice this left side, we have the action box. You can expand it and find options in this, in this box you can do many things with it. In in this tab, specifically, you can split the graph, based on the conditions you provided in the experimental design file. For example, we can do a treatment and we split Drug1, Drug2, placebo. Then you can see if there's any difference between different groups, right, and we can unsplit if you want to go back to our original plot. And the second tab is variable gene selection, which is the backbone of this workflow. The red dots mean those genes I selected for subsequent analysis and these gray dots are the genes that will be discarded in analysis. And if you expand action box, you can see, since we use the VST, we specified 2000 genes in this analysis. But if you change your mind, you can, you can, whenever you change your mind, you can type in a different number of genes and then click OK. So all the tabs will be refreshed as based on this new number. So after you have a list of variable genes, what you are going to do is to further reduce the dimensions by performing sparse SVD analysis, which is equivalent to principal component analysis. So after you apply SVD analysis, you can plot out the top two SVDs or principal components. Try to check the global structure of your data set. So in this case, we can see there are two big groups in this data set, which is interesting. And also we provide a 3D plot to help you to further explore your data. Sometimes there's, there are some insights that you cannot, that you cannot identify in a 2D plot; 3D plots sometimes can really provide additional value. And we have those SVDs, depending on how many you selected (20 or 30), you can use them to perform clustering. In this case we selected hierarchical clustering and we find nine clusters in your data set. In addition to dendogram, we also offer a constellation plot which I really like because this plot is similar to t-SNE or UMAP. It gives you a better idea about the distance between different groups, right. For example, there are three groups, big groups, three clusters Kind of distinct from other groups. And if you want to see where are they in the global structure here, this is the top three clusters I detect. Look at 3D plots, we can see all those highlighted ones are over here. And then go to 2D again, this is one of two big clusters, you know, that I said I highlighted so it's interactivity really help you to observe and visualize your data in multiple ways. And we also provide a parallel plot to help you to further identify the different patterns across different groups. And the next tab is embedding, which means t-SNE and UMAP parts if you have R packaging installed. I will call out to R, run the analysis, and bring back the data and visualize it in JMP. So here is a t-SNE plot. We have nine clusters very nicely being separated. On the bottom is exactly the same plot but this time I colored them with the marker genes you provided; we have 14 marker genes. So you can, using this feature, switch to click through to see where these genes are expressed, right. For example, there's a GNLY gene, highly expressed in this little cluster and we are wondering, what's our data? We select and go back and now we see all, most of them are from cluster nine, cluster eight. So GNLY is a gene for NK cells. This is a marker gene for NK cells. So now we have idea about what a group of this cell is, right. And also we have action buttons here, help you to do more things. If you want to switch to UMAP, if you prefer that, you can do it. Now the plots associate to UMAP. Exact same thing but UMAP does give you a little better, a little bit better separation and it can preserve more global structure in the visualization. And also we can provide some ways help you to remove the cells that might be contaminated or have some quality problems. For example, we don't like a group of cells here, we can remove them from the visualization. Make it cleaner, but you can always bring them back. And again, we can split the plots, which is a split graph button. This time, we can split by the gender, we split by female, male. Right, we can compare the gene expression level across the gender groups, which is pretty useful sometimes. And we are split. And the next tab is providing you more visualization tools to visualize gene expression levels in the nine groups, nine group cells, right. So the first part is called a violin plot. Again, we have a feature switcher help you to go through all those genes in different clusters, right. Now you can see, depending on how, how tall the part of the graph is and what its density is. You can clearly see where those genes are highly expressed. For example, again, we give example I'm using this gene, GNLY, you can see it's highly expressed in cluster eight. And in the middle, the second plot we are providing you is ridgeline plot. A ridgeline plot is organizing the clusters on the Y axis and the gene expression level at X axis. But it's basically providing you a similar thing depending on what you like. For example, GNLY, again, we can see cluster eight have GNLY highly expressed by knot or other clusters. And the bottom we have another plot called dot plot. This is the new plot we just added to this report. In addition to showing you that gene expression levels, dot plot can also show you the percentage of the cells expressing that gene. For example, take a look at this place, PPBP gene. And we can see this cluster, in cluster seven we have had, we can see 100% of cells in cluster seven expressing this PPBP gene. So this gene is the marker gene for PPBP cells actually. So now it's very clear that cluster seven is is one type of blood cell, which is a PPBP cell and they take a look at other, like for example, cluster two. There's only 12% of the cells expressing this gene. So there might be some contamination but this group of cells is definitely not PPBP cells. So this plot just showing you the expression level and expression percentage in each cluster which offers additional information in the plot. So next tab is also very useful. It's called feature screening. What I did was a fading boosting forest algorithm and then using that genes to predict the clusters. So the most important genes which contributed to the separation of the cells are ranked in this table. And the correct way to to view these genes is to open this action box. You select this top maybe top 35 genes you want to visualize. You click OK. So the next tab will show you only 35 genes you selected. So those are the genes, mostly informative. Right. They, they can explain why those different groups of cells are separated. So again, we just switch or you can click through and try to see the patterns. And then if you notice, there lot of genes LYZ, CST3 and NKG7. Those are already in the marker genes or were provided, which means this feature screening method is really successful to pick up those most important genes in your data set. And another thing, you can do visualization is through GTEx database. The GTEx is a tissue-specific database. Tell you what genes expressed in which tissue in your human body. So we can directly send the gene list to the database. You just click OK. We will open the website and GTEx website will provide you a heat map, right, so with the top 35 genes. Now you can see where I've expressed in those two human tissues, organs, which is very convenient to see additional information. So now we've those marker genes being used, you probably can identify what group of cells are they, right. So there's one function here is called a recode. What it does is, you open it, now you can recode those numbers into actual cell names, right. For example, eight we already know it's NK, it's NK cells. And we can do...I already have names for every single one of them. So I just type in Those Monocite. Two is DC cells actually; three is FCGR3A+ monocite. These are Naive CD+ T cells. Group five Memory CD4+ T cells. And CD8+ T Meijian Guan Meijian Guan for group six; seven is PPBP, as we already saw and the ninth is B cells. So with those recode we click recode. Now since all the plots and the tabs are connected, now you can find all the numbers have changed into actual cell names. So it's just help you to explore your data in easily, right. You can know where what those cells are and you can again do some exploration on and in your plots. And again, including this clustering plots, you know, see the custom name has been changed into the actual cell names. Um, so That's it for today's topic. And if you have any questions, you can send me an email or leave a message on the JMP Community. Thank you so much for your time.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

John Cromer, Sr. Research Statistician Developer, JMP While the value of a good visualization in summarizing research results is difficult to overstate, selection of the right medium for sharing with colleagues, industry peers and the greater community is equally important. In this presentation, we will walk through the spectrum of formats used for disseminating data, results and visualizations, and discuss the benefits and limitations of each. A brief overview of JMP Live features sets the stage for an exciting array of potential applications. We will demonstrate how to publish JMP graphics to JMP Live using the rich interactive interface and scripting methods, providing examples and guidance for choosing the best approach. The presentation culminates with a showcase of a custom JMP Live publishing interface for JMP Clinical results, including the considerations made in designing the dialog, the mechanics of the publishing framework, the structure of JMP Live reports and their relationship to the JMP Clinical client reports and a discussion of potential consumption patterns for published reviews. Auto-generated transcript... Speaker Transcript John Cromer Hello everyone, Today I'd like to talk about two powerful products that extend JMP in exciting ways. One of them, JMP Clinical, offers rich visualization, analytical and data management capabilities for ensuring clinical trial safety and efficacy. The other, JMP Live, extends these visualizations to a secure and convenient platform that allows for a wider group of users to interact with them from a web browser. As data analysis and visualization becomes increasingly collaborative, it is important that both creating and sharing is easy. By the end of this talk, you'll see just how easy it is. First, I'd like to introduce the term collaborative visualization. Isenberg, et al., defines it as the shared use of computer supported interactive visual representations of data on more than one person with a common goal of contribution to join information processing activities. As I'll later demonstrate, this definition captures the essence of what JMP, JMP Clinical and JMP Live can provide. When thinking about the various situations in which collaborative visualization occurs, it is useful to consult the Space Time Matrix. In the upper left of this matrix, we have the traditional model of classroom learning and office meetings, with all participants at the same place at the same time. Next in the upper right, we have participants at different places interacting with the visualization at the same time. In the lower left, we have participants interacting at different times at the same location, such as in the case of shift workers. And finally, in the lower right, we have flexibility in both space and time with participants potentially located anywhere around the globe and interacting with the visualization at any time of day. So JMP Live can facilitate this scenario. A second way to slice through the modes of collaborative visualization is by thinking about the necessary level of engagement for participants. When simply browsing a few high-level graphs or tables, sometimes simple viewing can be sufficient. But with more complex graphics and for those in which the data connections have been preserved between the graphs and underlying data tables, users can greatly benefit by also having the ability to interact with and explore the data. This may include choosing a different column of interest, selecting different levels in a data filter and exposing detailed data point hover text. Finally, authors who create visualizations often have a need to share them with others and by necessity will also have the ability to view, interact with and explore the data. and JMP and JMP Clinical for authors who require all abilities. A third way to think about formats and solutions is by the interactivity spectrum. Static reports, such as PDFs, are perhaps the simplest and most portable, but generally, the least interactive Interactive HTML, also known as HTML5, offers responsive graphics and hover text. JMP Live is built on an HTML5 foundation, but also offer server-side computations for regenerating the analysis. While the features of JMP Live will continue to grow over time, JMP offers even more interactivity. And finally, There are industry-specific solutions such as JMP Clinical which are built on a front framework of JMP and SAS that offer all of JMP's interactivity, but with some additional specialization. So when we lay these out on the interactivity spectrum, we can see that JMP Live fills the sweet spot of being portable enough for those with only a web browser to access, while offering many of the prime interactive features that JMP provides So the product that I'll use to demonstrate creating a visualization is JMP Clinical. JMP Clinical, as I mentioned before, offers a way to conveniently assess clinical trial safety and efficacy. With several role-based workflows for medical monitors, writers, clinical operations and data managers, and three review templates, predefined or custom workflows can be conveniently reused on multiple studies, producing results that allow for easy exploration of trends and outliers. Several formats are available for sharing these results, from static reports and in-product review viewer and new to JMP Clinical ??? and JMP Live reports. The product I'll use to demonstrate interacting with on a shared platform is JMP Live. JMP Live allows users with only a web browser to securely and conveniently interact with the visualizations, and they could specify access restrictions for who can view both the graphics and the underlying data tables with the ability to publish a local data filter and column switcher. The view can be refreshed in just a matter of seconds. Users can additionally organize their web reports through titles, descriptions and thumbnails and leave comments that facilitate discussion between all interested parties. So explore the data on your desktop with JMP or JMP Clinical, published a JMP Live with just a few quick steps, share the results with colleagues across your organization, and enrich the shared experience through communication and automation. So now I would like to demonstrate how to publish a simple graphic from JMP to JMP Live. I'm going to open the demographics data set from the sample study Nicardipine, which is included with JMP Clinical. I can do this either through the file open menu where I can navigate to my data set dt= open then the path to my data table. So I'm going to click run scripts to open that data table. Okay. So now I'd like to create a simple visualization. I'm going to, let's say, I'd like to create a simple box plot. Or click graph, Graph Builder. And here I have a dialogue from moving variables into roles. I'm going to move the study site identifier into the X role. Age into Y. And click box plot. And click Done. So here's one quick and easy way to create a visualization in JMP. Alternatively, I can do the same thing with the script. And so this block of code I have here, this encapsulates a data filter and a Graph Builder box plot into a data filter context box. So I'm going to run this block of code. And here you see, I have some filters and a box plot. Now, notice how interactive this filter is and the corresponding graph. I can select a different lower bound for age; I can type in a precise value, let's say, I'd like to exclude those under 30 and suppose I am interested in only the first 10 study side identifiers. OK. So now I'd like to share this visualization with some of my colleagues who don't have JMP but they have JMP Live. So one way to publish this to JMP Live is interactively through the file published menu. And here I have options for for my web report. Can see I have options for specifying a title, description. I can add images. I can choose who to share this report with. So at this point, I could publish this, but I'd like to show you how to do so using the script. So I have this chunk of code where I create a new web report object. I add my JMP report to the web report object. I issue the public message to the web report, and then I automatically open the URL. So let me go ahead and run that. You can see that I'm automatically taken to JMP Live with a very similar structure as my client report. My filter selections have been preserved. I can make filter selection changes. For example, I can move the lower bound for age down and notice also I have detailed data point hover text. I have filter-specific options. And I also have platform-specific options. So any time you see these menus. You can further explore those to see what options are available. Alright, so now that you've seen how to publish a simple graphic from JMP to JMP Live. How about a complex one, as in the case of a JMP Clinical report. So what I'm going to do is open a new review. I will add the adverse events distribution report to this review. I will run it with all default settings. And now I have my adverse events distribution report, which consists of column switchers for demographic grouping and stalking, report filters, an adverse events counts graph, tabulate object for counts and some distributions. Suppose I'm interested in stacking my adverse events by severity. I've selected that and now I have my stoplight colors that I've set for my adverse events for mild, moderate and severe. At this point I'm...I'd like to share these results with a colleague who maybe in this case has JMP, but there are certain times where they prefer to work through a web browser to to inspect and take a look at the visualizations. So this point, I will click this report level create live report button. I will... ...and that...and now I have my dialogue, I can choose to publish to either file or JMP Live. I can choose whether to publish the data tables or not, but I would always recommend to publish them for maximum interactivity. I can also specify whether to allow my colleagues to download the data tables from JMP Live. In addition to the URL, you can specify whether to share the results only with yourself, everyone at your organization or with specific groups. So for demonstration purposes, I will only publish for myself. I'll click OK. Got a notification to say that my web report has been published. Over on JMP Live, I have a very similar structure. At my report filters, my column switchers with my column, a column of interest preserved. You can see my axes and legends and colors have also carried over. Within this web report, I can easily collapse or expand particular report sections, and many of the sections off also offer detailed data point hover text and responsive updates for data filter changes. Another thing I'd like to point out is this Details button in the upper right of the live report, where I can get detailed creation information, a list of the data tables that republished, as well as the script. And because I've given users the ability to download these tables and scripts, these are download buttons for those for that purpose. I can also leave comments from my colleagues that they can then read and take further action on, for example, to follow up on an analysis. All right, so from my final demo, I would simply like to extend my single clinical report to a review consisting of two other reports enrollment patterns, and findings bubble plot. So I'm going to run these reports. Enrollment patterns plots patient enrollment over the course of a study by things like start date of disposition event, study day and study site identifier. Findings bubble plot, I will run on the laboratory test results domain. And this report features a prominent animated bubble plot, in which you can launch this animation. You can see how specific test results change over the course of a study. You can pause the animation. You can scroll to specific, precise values for study day and you can also hover over data points to reveal the detailed information for each of those points. create live report for review. I have a...have the same dialogue that you've seen earlier, same options, and I'm just going to go ahead and publish this now so you can see what it looks like when I have three clinical reports bundled together and in one publication. So when this operation completes, you will see that will be taken to an index page corresponding to report sections. And each thumbnail on this page corresponds to report section in which we have our binoculars icon on the lower left, that indicates how many views each page had. I have a three dot menu, where you can get back to that details view. If you click Edit, from here you can also see creation information and a list of data tables and scripts. And by clicking any of these thumbnails, I can get down to the report, the specific web report of interest. So just because this is one of my favorite interactive features, I've chosen to show you the findings bubble plot on JMP Live. Notice that it has carried over our study day, where we left off on the client, on study day 7. I can continue this animation. You can see study day counting up and you can see how our test results change over time. I can pause this again. I can get to a specific study day. I can do things like change bubble size to suit your preference. Again, I have data point hover text, I can select multiple data points and I have numerous platform specific options that will vary, but I encourage you to take a look at these anytime you see this three dot menu. So to wrap up, let me just jump to my second-last slide. So how was all this possible? Well, behind the scenes, the code to publish a complex clinical report is simply a JSL script that systematically analyzes a list of graphical report object references and pairs them with the appropriate data filters, column switchers, and report sections into a web report object. The JSL publish command takes care of a lot of the work for you, for bundling the appropriate data tables into the web report and ensuring that the desired visibility is met. Power users who have both products can use the download features on JMP Live to conveniently share to conveniently adjust the changes ...to to... make changes on their clients and to update their... the report that was initially published, even if they were not the original authors. And then the cycle can continue, of collaboration between those on the client and those on JMP Live. So, as you can see, both creating and sharing is easy. With JMP and JMP Clinical, collaborative visualization is truly possible. I hope you've enjoyed this presentation, and I look forward to any questions that you may have.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

Mike Anderson, SAS Institute, SAS Institute Anna Morris, Lead Environmental Educator, Vermont Institute of Natural Science Bren Lundborg, Wildlife Keeper, Vermont Institute of Natural Science Since 1994, the Vermont Institute of Natural Science’s (VINS) Center for Wild Bird Rehabilitation (CWBR), has been working to rehabilitate native wild birds in the northeastern United States. One of the most common raptor patients CWBR treats is the Barred Owl. Barred Owls are fairly ubiquitous east of the Rocky Mountains. Their call is the familiar “Who cooks for you, who cooks for you all.” They have adapted swiftly to living alongside people and, because of this, are commonly presented to CWBR for treatment. As part of a collaboration with SAS, technical staff from JMP and VINS have been analyzing the admission records from the rehabilitation center. Recently we have used a combination of Functional Data Analysis, Bootstrap Forest Modeling, and other techniques to explore how climate and weather patterns can affect the number of Barred Owls that arrive at VINS for treatment — specifically for malnutrition and related ailments. We found that a combination of temperature and precipitation patterns results in an increase in undernourished Barred Owls being presented for treatment. This session will discuss our findings, how we developed them, and potential implications in the broader context of climate change in the Northeastern United States. Auto-generated transcript... Speaker Transcript Mike Anderson Welcome, everyone, and thank you for joining us. My name is Anna Morris and I'm the lead environmental Educator at the Vermont Institute of Natural Science or VINS in Quechee, Vermont. I'm Bren Lundborg, wildlife keeper at VINS center for wildlife rehabilitation and I'm Mike Anderson JMP systems engineer at SAS. We're excited to present to you today our work on the effects of local weather patterns on the malnutrition and death rates of wild barred owls in Vermont. This study represents 18 years of data collected on wild owls presented for care at one avian rehabilitation clinics and unique collaboration between our organization and the volunteer efforts of Mike Anderson at JMP. Let's first get to know our study species, the barred owl, with the help of a non releasable rehabilitative bird serving as an education ambassador at the VINS nature center. Yep, this owl was presented for rehabilitation in Troy, New Hampshire in 2013 and suffered eye damage from a car collision from which she was unable to recover. Barred owls like this one are year-round residents of the mixed deciduous forests of New England, subsisting on a diet that includes mammals, birds, reptiles, amphibians, fish and a variety of terrestrial and aquatic invertebrates. However, the prey they consume differs seasonally, with small mammals composing a larger portion of the diet in the winter. Their hunting styles differ in winter as well, due to the presence of snowpack, which can shelter small mammals from predation. Barred owls are known to use the behavior of snow punching or pouncing downward through layers of snow to catch prey detected auditorially. Here's a short video demonstrating this snow punching behavior. I've seen in that quick clip barred owls can be quite tolerant of human altered landscapes, with nearly one quarter of barred owl nests utilizing human structures. There are also the most frequently observed owl species by members of the public in Vermont, according to the citizen science project, iNaturalist, with 468 research grade observations of wild owls logged. As such, barred owls are commonly presented to wildlife rehabilitation clinics by people who discover injured animals. The Vermont Institute of Natural Sciences Center for wild bird rehabilitation or CWBR is the federal and state licensed wildlife rehabilitation facility located in Quechee, Vermont. All wild living avian species that are legal to rehabilitate in the state are submitted as patients to CWBR and we received an average of 405 patients yearly from 2001 to 2019, representing 193 bird species. 90% of patients presented at CWBR come from within 86 kilometers of the facility. Of the patients admitted during the 18 year period of the study, 11% were owls of the order Strix varia, comprising six species, with barred owls being the most common. However, year to year, the number of barred owls received as patients by CWBR has varied widely compared to another commonly received species, the American Robin. Certain years, such as the winter of 2018 to 2019 had been anecdotally considered big barred owl years by CWBR staff and other rehabilitation centers in the Northeastern US for the large number presented as patients. One explanation proposed by local naturalists attributes the big year phenomenon to shifts in weather patterns. When freeze/thaw cycles occur over short time scales, these milder, wetter winters are thought to pose challenges to barred owls relying on snow plunging for prey capture. Specifically the formation of a layer of ice on top of the snow can prevent owls from capturing prey using this snow plunging technique as the owls may not be able to penetrate this ice layer. In order to feed...I lost my place. In order to feed the animals may therefore use alternative hunting locations or styles or suffer from weakness due to malnutrition, which could lead to adverse interactions with humans, resulting in injury. This study was undertaken to determine if a relationship exists between higher than average winter precipitation and the number of barred owls presented during those years at CWBR for rehabilitation. Though there are several possible explanations for the variation in the number of patients associated with regional weather, we sought to determine if there was support for the ice layer hypothesis by further investigating whether barred owls presented during wetter winters exhibited malnutrition as part of the intake diagnosis in greater proportion than in dryer winters. This would suggest that obtaining food was a primary difficulty, leading to the need for rehabilitation, rather than a general population increase, which would likely lead to a proportional increase in all intake categories. Initially we expected that there would be a fairly simple time series analysis relationship to this. We went and looked at the original data for the admissions and just to compare as, as Bren said, just to compare the data between the barred owls and the American robins, you can see for bad years, which I've marked here in blue, except for the the gray one which is actually had a hurricane involved, we can see there's a very strong periodic signal associated with the robins. We can see that the year-round resident barred owls should have something resembling a fairly steady intake rate, but we see some significant changes in that year to year. Looking at the contingency analysis, we can see that the green bands, the starvation, correlates fairly nicely with those years where we have big barred owl years. Again, pointing out 2008, 2015, 2019, these being ski season years instead, which I'll make clear in a moment. 2017 doesn't show up, but it does have a big band of unknown trauma and cause, and that was from a difference in how they were triaging the incoming animals that year. The one, the one trick to working with this is that we needed to use functional data analysis to be able to take the year over year trends and turn them into a signal that we can analyze effectively against weather patterns and other data that we were able to find. Looking here, it's fairly easy to see that those years that we would call bad years have a very distinctive dogear...dogear...dogleg type pattern. You can see 2008, 2017, 2019, 2015. Again, Most importantly, those signals tend to correlate most strongly with this first Eigen function in our principal component analysis. You can see quite clearly here that component one does a great job with discriminating between the good years and the bad years with that odd hurricane year right in the middle where it should be. You can also look at the profiler for FPC one and you can see that as we raise and lower that profiler, we see that dogleg pattern become more pronounced. The next question is, is how do we get the data for that kind of an analysis? How do we get the weather data that we think is important? Well, it turns out that there's a great organization that's a ski resort about 20 miles away from here that has been collecting data from as far back as the 50s. And they've also been working with naturalists and conservation efforts, providing their ecological or their environmental data to researchers for different projects, and they gave us access to their database. This is an example of base mountain temperature at Killington, Vermont, and you can see that the the bad years, again colored in blue here, tend to have a flatter belly in their low temperature. You can see for instance, looking at 2007, the first one in the upper left corner, you can see that there's a steep drop down, followed by a steep incline back up. Whereas 2008, which is one of the bad years for for owl admissions, we have a fairly flat, and if not maybe in a slightly inverted peak in the middle. And that's fairly consistent, with the exception of maybe 2015, throughout the other throughout the other the other years. So I took all of that data and used functional data explorer to get the principal components for our responses. We end up having, therefore, a functional component on the response and a functional component on the factors. This is an example of one of those for the...what turns out to be one of the driving factors of this analysis, and you can see it does a very nice job of pulling out the principal components. The one we're going to be interested in in a moment is this Eigenfunction4. It doesn't look like much right now, but it turns out to be quite important. So let's put all this together. I use the combination of generalized regression, along with the autovalidation strategy that was pioneered by Gotwalt and Ramsay a few years ago to build a model of this of the behavior. We can see we get a fairly good actual by predictive plot for that. We get a nice r square around 99% and looking at the reduced model, we see that we have four primary drivers, the cumulative rain that shows up. That makes sense. We can't have rain without...we can't have ice without rain. Also a temperature factor, we need temperature to have a strong...to have ice. But also we have the sum of the daily snowfall or the daily snowfall. That's a max total snowfall per year, and the sum of the daily...the daily rainfall as well. And taking all of this, we can put together start to put together a picture of what bad barred owl years look like from a data driven standpoint. We can see fairly clearly. I'm going to show you first again what a bad barred...what a bad year looks like from the standpoint of the of the of the the admission rates. And we can see here. Let me show you what a bad, bad year looks like. That's a bad year; that's a good year, fairly dramatic difference. Now we're going to have to pay fairly close attention to the... We're gonna have to pay fairly close attention to the other factors to see because it's a very subtle change in the the temperatures, in the rain falls that trigger this good year/bad year. It's it's kind of interesting how how tiny the effects are. So first, this is the total snowfall per year. And we're going to pay attention to the slope of this curve for a good year and then for a bad year. Fairly tiny change, year over year. So it's a it's a subtle change, but that subtle change is one of the big drivers. We need to have a certain amount of snowfall present in order to facilitate the snow diving. The other thing, if we look at rain, we're going to look at the belly of this rainfall right here, around, around week 13 in the in the ski season. There's a good year. And there's a bad year. Slightly more rain earlier in the year, and with a flatter profile going into spring. And again, looking at the cumulative rain over the season, a good year tends to be a little bit drier than a bad year. And lastly, most importantly, the temperature. This one is actually fairly...this is that belly effect that we were seeing before. We see in early years or in good years that we have that strong decline down and strong climb out in the temperature, but for bad years we get just slightly more bowlshaped effect overall. And I'm going to turn it over to Bren to talk about what that means in terms of barred owl malnutrition. Malnutrition has a significant negative impact upon survival of both free ranging owls and those receiving treatment at a rehabilitation facility. Detrimental effects include reduced hunting success, lessened ability to compete with other animals or predator species for food, and reduced immunocompetence. Some emaciated birds are found too weak to fly and are at high risk for complications such as refeeding syndrome during care. For birds in care, the stress of captivity, as well as healing from injuries such as fractures and traumatic brain injuries can double the caloric needs of the patient, thus putting further metabolic stress on an already malnourished bird. Additionally, scarcity or unavailability of food may push owls closer to human populated areas, leading to increased risk for human related causes of mortality. Vehicle strikes are the most common cause of intakes for barred owls in all years and hunting near roads and human occupied habitats increases that risk. In the winter of 2018 to 2019, reports of barred owls hunting at bird feeders and stalking domestic animals, such as poultry, were common. Hunting at bird feeders potentially increases exposure to pathogens, as they are common sites of disease transmission, it may lead to higher rates of infectious diseases such as salmonellosis and trichomoniasis. Difficult winters also provide extra challenges for first year barred owls. Clutching barred owls are highly dependent on their parents and will remain with them long after being able to fly and hunt. And once parental support ends, they are still relatively inexperienced hunters facing less prey availability and harsher conditions in their first winter. Additionally, the lack of established territories may lead them to be more likely to hunt near humans, predisposing them to risks such as vehicle collision related injuries. Previous research on a close relative of the barred owl, the northern spotted owl of the Pacific Northwest, shows a decline in northern spotted owl in fecundity and survival associated with cold, wet weather in winter and early spring. In Vermont, the National Oceanic and Atmospheric Administration has projected an increase in winter precipitation of up to 15% by the middle of the 21st century, which may have specific impacts on populations of barred owls and their prey sources. The findings of this study provide important implications for the management of barred owl populations and those of related species in the wake of a change in climate. Predicted changes to regional weather patterns in Vermont and New England forecast that cases of malnourished barred owls will only increase in frequency over the next 20 to 30 years as we continue to see unusually wet winters. Barred owls, currently listed by the International Union for Conservation of Nature as a species of least concern with a population trend that is increasing, will likely not find themselves threatened with extinction rapidly. However, ignoring this clear threat to local populations may cascade through the species at large and exacerbate the effects of other conservation concerns, such as accidental poisoning and nest site loss. These findings also highlight the need for protocols to be established on the part of wildlife rehabilitators and veterinarians for the treatment of severe malnourishment in barred owls, such as to avoid refeeding syndrome, and provide the right balance of nutrients for recovery from an often lethal condition. Rehabilitation clinics would benefit from a pooling of knowledge and resources to combat this growing issue. Finally, this study shows yet another way in which climate change is currently affecting the health of wildlife species around us. Individual and community efforts to reduce human impacts on the climate will not be sufficient to reduce greenhouse gas emissions at the scale necessary to halt or reverse the damage that has been done. Action on the part of governments and large corporations must be taken, and individuals and communities have the responsibility to continue to demand that action. We would like to thank the staff and volunteers at the Vermont Institute of Natural Science, as well as at JMP, who helped collect and analyze the data presented here, especially Gray O'Tool. We'd also like to thank the Killington Ski Resort for providing us with the detailed weather data. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

PATRICK GIULIANO, Senior Quality Engineer, Abbott Charles Chen, Continuous Improvement Expert, Statistics-Aided Engineering (SAE), Applied Materials Mason Chen, High School Student, Stanford Online High School Cooked foods such as dumplings are typically prepared without precise process control on cooking parameters. The objective of this work is to customize the cooking process for dumplings based on various dumpling product types. During the cooking process in dumpling preparation, the temperature of the water and time duration of cooking are the most important factors in determining the degree to which dumplings are cooked (doneness). Dumpling weight, dumpling type, and batch size are also variables that impact the cooking process. We built a structured JMP DSD platform with special properties to build a predictive model on cooking duration. Internationally recognized ISO 22000 Food Safety Management and the Hazard Analysis Critical Control Point (HACCP) schemas were adopted. JMP Neural Fit techniques using modern data mining algorithms were compared to RSM. Results demonstrated the prevalence of larger main effects from factors such as: boiling temperature, product type, dumpling size/batch as well as interaction effects that were constrained by the mixture used in the dumpling composition. JMP Robust Design Optimization, Monte Carlo Simulation and HACCP Control Limits were employed in this design/analysis approach to understand and characterize the sensitivity of dumpling cooking factors on the resulting cooking duration. The holistic approach showed the synergistic benefit of combining models with different projective properties, where recursive partition-based AI models estimate interaction effects using classification schema and classical (Stepwise) regression modeling provides the capability to interpret interactions of 2nd order, and higher, including potential curvature in quadratic terms. This paper has demonstrated a novel automated dumpling cooking process and analysis framework which may improve process throughout, lower the cost of energy, and reduce the cost of labor (using AI schema). This novel methodology has the potential to reshape thinking on business cost estimation and profit modeling in the food-service industry. Auto-generated transcript... Speaker Transcript Patrick Giuliano All right. Well, welcome everyone. Thank you all for taking the time to watch this presentation. Preparing the Freshest Steamed Dumplings. And my name is Patrick Giuliano and my co authors are Mason Chen and Charles Chen from Applied Materials, as well as Yvanny Chang. Okay, so today I'm going to tell you about how I harnessed...my team and I harness the power of JMP to really understand about dumpling cooking. And so that the general problem statement here is that most foods like dumplings are made without precise control of cooking parameters. And the taste of a dumpling, and as well as other outputs to measure how good a dumpling is, is adversely affected by improper cooking time and this is intuitive to everyone who's enjoyed food so we needn't to talk too much about that. But Sooner or later AI and robotics will be an important part of the food industry. And our recent experience with Covid 19 has really highlighted that and and so I'm going to talk a little bit about how we can understand the dumpling process better using a very multi faceted modeling approach, which uses many of JMP's modeling capabilities, including robust Monte Carlo design optimization. So why dumplings? Well dumplings are very easy to cook. And by cooking them, of course, we kill any foreign particles that may be living on them. And cooking can involve very limited human interaction. So of course with that, the design and the process space related to cooking is very intuitive and extendable and we can consider the physics associated with this problem and try to use JMP to help us really understand and investigate the physics better. AI is really coming sooner or later because of Covid 19, of course, and why would robotic cooking of dumplings be coming? Well And also other questions might be, what are the benefits? What are the challenges of cooking dumplings in an automated way in a robotic setting? And of course, this could be a challenge because actually robots don't have the nose to smell. And so because of that, that's a big reason why, in addition to an advanced and multifaceted modeling approach, it's important to consider some other structured criteria. And later in this presentation, I'm going to talk a little bit, a little bit about the HACCP criteria and how we integrated that in order to solve our problem in a more structured way. Okay, so before I dive into a lot of the interesting JMP analysis, I'd like to briefly provide an introduction into heat transfer physics, food science and how different heat transfer mechanisms affect the cooking of dumplings. So as you can see in this slide, there's a Q at the top of the diagram and the upper right and that Q is referred to...it refers to the heat flux density, which is the amount of energy that flows through a unit area per unit time, and the direction of temperature change. From the point of view of physics, proteins and raw and boiled meat differ in their amounts of energy. An activation energy barrier has to be overcome in order to turn raw meat protein structure into a denatured or compactified structure as shown here. in this picture at the left. So the first task of the cook, when boiling meat in terms of physics, is to increase the temperature throughout the volume of the piece at least To reach the temperature of the denaturation. Later, I'm going to talk about the most interesting finding of this particular phase of the experiment where we discovered that there was a temperature cut off. And and intuitively, you would think that below a certain temperature dumplings would be cooked...wouldn't be cooked properly, they would be to soggy, and above a certain temperature, perhaps they would also be too soggy or they may be burned or crusty. One other final note about the physics here is that at the threshold for boiling, the surface temperature of the water fluctuates and bubbles will rise to the surface of the boiler and break apart and collapse and that can make it difficult to gather...capture and...excuse me...to capture accurate readings of temperature. So that leads us into some...what are some of the tools that we used to conduct an experiment? Well, of course, we used a boiling cooker and that's very important. Of course, we used a something to measure temperature and for this we used an infrared thermometer and we used a timer, of course, and we used a mass balance to weigh the dumpling and all the constituents going into the dumpling. We might consider something called Gare R&R in future studies and where we may quantify the repeatability and reproducibility of our of our measurement tools. In this experiment, we didn't, but that is very important, because this helps us maximize the precision of our model estimates by minute...minimizing the noise components associated with our measurement process. And those noise components could not only be a fact...a factor of say that accuracy tolerance for the gauge, but they they could also be due to how the person interacts with the with the measurement itself. And, and, in particular, I'm going to talk a little bit about the challenge with measuring boiling and cooking temperature at high at high temperature. Okay so briefly, this...we set this experiment up as a designed experiment. And so we had to decide on the tools first. We had to decide on how we would make the dumpling. So we needed a manufacturing process and appropriate role players in that process. And then we had to design a structured experiment. And to do that we use the definitive screening design and looked at some characteristics of the design to ensure that the design was performing optimally for our experiment. Next we executed the experiment. And then we measured and recorded the response. And of course, finally, the fun part, we got to effectively interpret the data and JMP. And these are graphs that the right here that are showing scatter plot matrices generated in JMP, just using the graph function. And these actually give us an indication of the uniformity that prediction space. I'll talk a little bit of more more about that later...then next...in the coming slides. Okay, so here's our data. Here's our data collection plan and at the right is the response that we measured, which is a dumpling rising time or cooking time. We collected 18 runs in a DSD, which we generated in JMP using the DSD platform and in under the DOE menu. And we collected information on the mass of the meat, the mass of the veggies going in, the mass of the...the type of the meat, rather, the composition of the vegetables (being either cabbage or mushroom), and of course the total weight, the sizes of the batch that we cooked, the number of dumplings per batch, and the water temperature. So this slide just highlights some of the the amazing power of a DSD and and I won't go into this too much, but DSDs are very lauded for their flexible and powerful modeling characteristics. And they allow the great potential for conducting Blitz screening and optimise optimization in a single experiment. This chart at the at the right is a correlation matrix generated in JMP, in its designed diagnostics platform of the DOE menu and it and it's it's particularly powerful for showing the extent of aliasing or confounding among all the factor effects in your model. And what this graphic shows clearly is that by the darkest blue, there's no correlation and, as the as the correlation increases and we get it to shades of gray, and then finally, as we get to very positive correlation, we get to shades of red. So what we're really seeing is that main effects are completely uncorrelated with each other, which is what we like to see in two factor interactions, the main effects are uncorrelated. With with quadratic effects which is up in this right quadrant. And then the quadratic effects are actually only partially correlated with each other and then you have these higher order interaction terms, which are which are really partially correlated with interaction effects and these types of characteristics make this particular design superior and to the typical Resolution III and Resolution IV fractional factorial designs that that we used to be taught before the DSDs. Okay. So just quickly discussing a little bit about the design diagnostics. Here's the same correlation...a similar correlation plot, except the factors have actually been given their particular names and after in...in running this DSD design. And this is just a gray and white version of of a correlation matrix to help you see the extent of orthoginality or not, not being ??? among the factors. And so what you can see in our experiment is that we actually did observe a little bit of confounding among a batch size a batch size and meat unsurprisingly, and then, of course, meat with the interaction between meat and the vegetables that are in the in the dumpling. And note that we imposed one design constraint here, which we did observe some confounding with, which is the very intuitive constraint in that the dumplings...than the total mass of the dumpling is the composition of each of the components of the dumpling itself. So, the other... so why, why are we doing this? Why are we assessing this quote unquote uniformity here, this the scatter plot matrix here and and what is this kind of telling us? Well, in order to maximize prediction capability throughout the space of the of the predictor, in rising time in this case, we want to find the combinations of the factors that minimize the white areas. Okay. And in the white areas are where the prediction accuracy is thought to be weaker. And this is a and this is why we take the design and we put it into a scatter plot matrix and this is analogous to sort of the homogeneity of error assumption in ANOVA, where you know we look for a space, the space of prediction to be equally probable, or the equal variance assumption in linear regression. When we want we want this space to be equally probable across the range of the predictors. So in in this in this experiment, of course, in order to reduce the number of factors that we're looking at, first we used engineering and our understanding of the engineering and the physics of the problem. And so for identification, we identified six independent variables or variables that were least confounded with each other and and and we proceeded and with the analysis on the basis of these primary variables. Okay. So the first thing we did is we we took our generated design and we use stepwise regression to try to simplify the model, identify only the active factors in the model. And here you can use a combination of forward selection, backwards selection, and mixed as your stopping criteria in order to determine the model that explains the most variation in your response. And also, I can model meat type as discrete and or numeric or...rather discrete numeric, and in this way I can use this labeling to make the factor coding to correspond to the meat type being the shrimp or the pork, which we used. So what kind of a stopping rule can you use in the framework of this type of a regression model? Well, unfortunately, when I ran this model again and again, I wasn't really able to reproduce it exactly. And model reproduction can be somewhat inconsistent, since the fit...this type of a fitting schema involves a computational procedure to iterate to a solution. And so therefore, in this stepwise regression, the overfit risk is is typically higher. And oftentimes if there's any curvature in the model or there are two factor interactions, for example, the, the explanatory variants across...is shared across both of those factors where you can't tease apart that variability associated with one or the other. And so what we can see here clearly, based on the adjusted R squared, is that we're getting a very good fit, and probably a fit that's too good. Meaning that we can't predict in the future based on them on the fit to this particular model. Okay. So here's where it gets pretty interesting. So one of the things that we did first off, after running the stepwise is that we assigned independent uniform inputs to each of the factors in the model. And this is a sort of Monte Carlo implementation in JMP. A different kind of Monte Carlo implementation and and and it's a what's what's what's important to understand in this particular framework is that the difference between the main effect and and the total effect can indicate the extent of interaction. hat that this the extent of interaction associated with a particular factor in the model. And so this is showing that in particular, water temperature and meat, in addition to being most explanatory in terms of total effect, may likely interact with other factors in this particular model. And what what you see, of course, is that we identified water temperature, meat, and and and the meat type as our top predictors, using the Paredo plot for transformed estimates. The other thing I'd like to highlight here before I launch into some of the other slides is the sensitivity indicate indicator that we can invoke here and under the profiler after we assign independent uniform inputs, we can colorize the profile profiler to indicate the strength of the relationship between each of the input input factors and the and the response. And we can also use the sensitivity indicator, which is represented by these purple triangles, to show us the sensitivity or the you can say the strength of the relationship similar to the linear regression coefficient would indicate the strength, where the taller the triangle and the steeper the relationship, the stronger either in the positive or the negative direction and the wider and flatter the triangle, the less strong that relationship that's that factor plays. Okay. So we went about reducing our model and using some engineering sense and using the stepwise platform. And what we get is a this is a just a snapshot of our model fit from the originating from the DSD design and it has RSM structured as curvature. And you can see this is an interaction plot which shows the extent of interaction among all the factors in a in a pairwise way. And we've indicated where some of the interactions are present and what those interactions look like. So this is a model that really, we can get more of a handle on Okay, so I think one other thing to mention is that the design constraint that we imposed in is is similar to what you might consider a mixture design, where all the components add together and the constraint has to sum to 100%. Okay, so here's just a high level picture of the profiler and we we can adjust or modulate each of the input factors and then observe the, the impact on the response and and we did this in a very manual way just to get gain some intuition into how the model is performing And of course to optimize our cooking time what we confirmed was that the time has to be faster, of course, the variants associated with the cooking time should be lower. And the throughput the throughput and the power savings should be optimized, maximized. And those are two additional responses that we derived based on cooking time. Okay, so here's where we get into the optimization more fully into that optimization of the of the cooking process. And so as I mentioned before, we designed or we created two additional response variables that are connected to the physics, where we have maximum throughput and that depends on in how many dumpling... I'm sorry, depends on how many dumplings, but also weight and time. And power savings, which is the which is the product of the power consumed and the time for cooking, which is an energy component. And so in order to engage in this optimization process, we need to introduce error associated with each of the input factors and that's represented by these distributions at the bottom here. And and we also need to consider that the practical HACCP control window and of course measurement device capability, which is something that we would like to look at in future studies. And so here's just a, a nice picture of the HACCP control plan that we use and this is follows very similar to something like a failure modes and effect analysis in the quality profession. And it's just a structured approach to experimentation or manufacturing process development and where key variables are identified, and key owners and then what criteria are being measured against and how that criteria being validated. And so HACCP is actually common in the food science industry and it stands for Hazard Analysis Critical Control Point monitoring. And I think in addition to all of these preparation activities, mainly I was involved in the context of this experiment as a data collector and and data integrity is a very important thing. And so transcribing data appropriately is is definitely is definitely important. So all the HACCP control points need to be incorporated into the Monte Carlo simulation range and ultimately the HACCP tolerance range can be used to derive the process performance requirement. Okay, so we consider a range of inputs where Monte Carlo can confirm that this expected range is practical for the cooking time. We want to consider a small change in each of the input factors at each at each level each at each HACCP component level and and this is determined by the control point range. Based on the control point range, we can determine the delta x and the delta in each of the inputs from the delta y response time. And we can continue to increase the delta x incrementally iteratively, while hoping that the the increase is small enough so that the change in y is small enough to meet the specification. And usually in industry that's a design tolerance and in this case, it's our HACCP control point control parameter range or control parameter limit. And if if that iterative procedure fails, then we have to make the increment and X smaller and we call this procedure tolerance allocation. Okay. We did this sort of manually and using sort of as our own special recipe. Although this can be done in a more automated way in JMP and and in this case, you can see we have more all of our responses. So using multiple response optimization, as well as which would involve using invoking the desirability functions and maximizing the desirability under the prediction profiler as well as a Gaussian process modeling, also available under the prediction profiler. Okay. So next in the vein of, you know, using tools to complement each other and try to understand further understand our product...our process and our experiment, we use the, the neural modeling capability under the analyze menu, under the prediction modeling tools. And We, we tried to utilize it to facilitate our prediction. This model uses a TanH function, which can be more powerful to detect curvature in non linear effects, but it's sort of a black box and it doesn't really tie back to the physics. So while it's also robust to non normal responses and somewhat robust to aliasing confounding. it, it has its limitations, particularly with a small sample size, such as that we have here, and you can actually see that the r squared between the training and validation sets are not particularly the same or they vary so this model isn't particularly consistent for the purposes of prediction. Finally, we used the the partition platform in JMP to run recursive partitioning on our response time response. And and this model is is definitely relatively easy to interpret in general, but I think particularly for our experiment because we can see that for the rising time we have this temperature cut off at about 85 degrees C, and that and as well as some temperature differentiation with respect to maximum throughput, but in particular is 85 degrees cut off is most...is very interesting. The R squared note in this model is about .7, at least with respect to the rising temp response, which is pretty good for this type of a model, considering the small sample size. And what's most interesting with respect the to this cut off is that below 85 C, the water really wasn't boiling. There wasn't much bubbling, no turbulence. The reading was very stable. However, as we increased the temperature, the water started to circulate, turbulence in the water caused non uniform temperature, cavitation bubble collapse, steam rising, and it's basically an unstable temperature environment. In this type of environment convection dominates rather than conduction. And steam also blocks light of the infrared thermometer, which also increases increases the uncertainty associated with the temperature measurement. And and the steam itself presents a burn risk which, in addition to safety, it may impact how the operator adjusts the thermometer and puts the the adjust the distance in which the operator places the thermometer, which is very important for accuracy of measurement. So, and this, in fact, was why we capped our design at 95 C because it was really impossible to measure water temperature accurately any more above that. Okay. So what are ...where have we arrived here? Well, in summary, we...in this experiment we use DSD (DOE) to collect the data only. Then we use stepwise regression to narrow down the important effects, but we didn't go too deep into the stepwise regression. And we use common sense to minimize the number of factors in our experiment as well as engineering expertise. We also use independent uniform inputs, which is very powerful for giving us an idea of the magnitude of effects using, for example by colorizing the profile or by looking at the rank of the effects and also by looking at the difference between the main effect and the total effect to give us an indication of interaction present in the model. We also added sensitivity indicators under the profiler to help us quantify our global sensitivity for the purposes of the Monte Carlo optimization schema that we employed. The main effects in the model really, temperature, of course, and physics really explained explains why temperature's the number one factor as, as I've shared in our findings. And in addition, from between 80-90 degrees C, what we see from the profilers that we observed sort of a rapid transition and an increase in the sensitivity of the relationship between rising time and temperature which is, of course, consistent with our experimental observations. Secondly, with respect to the the effects of factors interacting with each other and because there are two different types of physics really interacting, basic physics...physics modes interacting are convection and conduction, the stepwise on the DSD is a good starting point, because it gives us a continuous model with no transformation With no Advanced neural or black box type transformation, wo we can at least get a good handle on on global sensitivity to begin with. And our neural models in our partition models couldn't show us this, particularly given the small sample size in our experiment. And finally, we use Monte Carlo simulate, robust Monte Carlo simulation in our own framework. And we also did a little bit of multiple response optimization on rising time and throughput in power consumption versus our important factors. And through this experiment, we began to really qualify and further our understanding of the importance of the most important factors in this experiment using a multi disciplinary modeling approach. Finally, I will share some references here for you for of interest. Thank you very much for your time and

0 attendees

0

Event has ended

0 attendees

0

Monday, October 12, 2020

Shamgar McDowell, Senior Analytics and Reliability Engineer, GE Gas Power Engineering Faced with the business need to reduce project cycle time and to standardize the process and outputs, the GE Gas Turbine Reliability Team turned to JMP for a solution. Using the JMP Scripting Language and JMP’s built-in Reliability and Survival platform, GE and a trusted third party created a tool to ingest previous model information and new empirical data which allows the user to interactively create updated reliability models and generate reports using standardized formats. The tool takes a task that would have previously taken days or weeks of manual data manipulation (in addition to tedious copying and pasting of images into PowerPoint) and allows a user to perform it in minutes. In addition to the time savings, the tool enables new team members to learn the modeling process faster and to focus less on data manipulation. The GE Gas Turbine Reliability Team continues to update and expand the capabilities of the tool based on business needs. Auto-generated transcript... Speaker Transcript Shamgar McDowell Maya Angelou famously said, "Do the best you can, until you know better. Then when you know better, do better." Good morning, good afternoon, good evening. I hope you're enjoying the JMP Discovery Summit, you're learning some better way ways of doing the things you need to do. I'm Shamgar McDowell, senior reliability and analytics engineer at GE Gas Power. I've been at GE for 15 years and have worked in sourcing, quality, manufacturing and engineering. Today I'm going to share a bit about our team's journey to automating reliability modeling using JMP. Perhaps your organization faces a similar challenge to the one I'm about to describe. As I walk you through how we approach this challenge, I hope our time together will provide you with some things to reflect upon as you look to improve the workflows in your own business context. So by way of background, I want to spend the next couple of slides, explain a little bit about GE Gas Power business. First off, our products. We make high tech, very large engines that have a variety of applications, but primarily they're used in the production of electricity. And from a technology standpoint, these machines are actually incredible feats of engineering with firing temperatures well above the melting point of the alloys used in the hot section. A single gas turbine can generate enough electricity to reliably power hundreds of thousands of homes. And just to give an idea of the size of these machines, this picture on the right you can see there's four adult human beings, which just kind of point to how big these machines really are. So I had to throw in a few gratuitous JMP graph building examples here. But the bubble plot and the tree map really underscore the global nature of our customer base. We are providing cleaner, accessible energy that people depend upon the world over, and that includes developing nations that historically might not have had access to power and the many life-changing effects that go with it. So as I've come to appreciate the impact that our work has on everyday lives of so many people worldwide, it's been both humbling and helpful in providing a purpose for what I do and the rest of our team does each day. So I'm part of the reliability analytics and data engineering team. Our team is responsible for providing our business with empirical risk and reliability models that are used in a number of different ways by internal teams. So in that context, we count on the analyst in our team to be able to focus on engineering tasks, such as understanding the physics that affect our components' quality and applicability of the data we use, and also trade offs in the modeling approaches and what's the best way to extract value from our data. These are, these are all value added tasks. Our process also entails that we go through a rigorous review with the chief engineers. So having a PowerPoint pitch containing the models is part of that process. And previously creating this presentation entailed significant copying and pasting and a variety of tools, and this was both time consuming and more prone to errors. So that's not value added. So we needed a solution that would provide our engineers greater time to focus on the value added tasks. It would also further standardize the process because those two things greater productivity and ability to focus on what matters, and further standardization. And so to that end, we use the mantra Automate the Boring Stuff. So I wanted to give you a feel for the scale of the data sets we used. Often the volume of the data that you're dealing with can dictate the direction you go in terms of solutions. And in our case, there's some variation but just as a general rule, we're dealing with thousands of gas turbines in the field, hundreds of track components in each unit, and then there's tens of inspections or reconditioning per component. So in in all, there's millions of records that we're dealing with. But typically, our models are targeted at specific configurations and thus, they're built on more limited data sets with 10,000 or fewer records, tens of thousands or fewer records. The other thing I was going to point out here is we often have over 100 columns in our data set. So there are challenges with this data size that made JMP a much better fit than something like an Excel based approach to doing this the same tasks. So, the first version of this tool, GE worked with a third party to develop using JMP scripting language. And the name of the tool is computer aided reliability modeling application or CARMA, with a c. And the amount of effort involved with building this out to what we have today is not trivial. This is a representation of that. You can see the number of scripts and code lines that testified to the scope and size of the tool as it's come to today. But it's also been proven to be a very useful tool for us. So as its time has gone on, we've seen the need to continue to develop and improve CARMA over time. And so in order to do this, we've had to grow and foster some in-house expertise in JSL coding and I oversee the work of developers that focus on this and some related tools. Message on this to you is that even after you create something like CARMA, there's going to be an ongoing investment required to maintain and keep the app relevant and evolve it as your business needs evolve. But it's both doable and the benefits are very real. A survey of our users this summer actually pointed to a net promoter score of 100% and at least 25% reduction in the cycle time to do a model update. So that's real time that's being saved. And then anecdotally, we also see where CARMA has surfaced issues in our process that we've been able to address that otherwise might have remained hidden and unable to address. And I have a quote, it's kind of long. But I wanted to just pass this caveat on automation from Bill Gates, on which he knows a thing or two about software development. "The first rule of any technology used in business is that automation applied to an efficient automation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency." So that's the end of the quote, but this is just a great reminder that automation is not a silver bullet that will fix a broken process, we still need people to do that today. Okay, so before we do a demonstration of the tool. I just wanted to give a high level overview of of the tool and the inputs and outputs in CARMA. And user user has to point the tool to the input files. So over here on the left, you see we have an active models file that's essentially the already approved models. And then we have empirical data. And then in the user interface, the user does some modeling activities. And then outputs are running models, so updates to the act of models in a PowerPoint presentation. And we'll also look at that. As a background for the data, I'll be using the demo. I just wanted to pass on that I started with the locomotive data set. And we'll see that JMP provides some sample data. So this is the case, there and that gives one population. Then I also added into additional population of models. And the big message here I wanted to pass on is that what we're going to see is all made-up data. It's not real; it doesn't represent the functionality or the behavior of any of our parts in the field, or it's it's just all contrived. So keep that in mind as we go through the results, but it should give us a way to look at the tool, nonetheless. So I'm going to switch over to JMP for a second and I'm using JMP 15.2 for the demo. And this data set is simplified compared to what we normally see. But like I said, it should exercise the core functionality in CARMA. So first, I'm just going to go to the Help menu, sample data, and you'll see the reliability and survival menu here. So that's where we're going. One of the nice things about JMP is that it has a lot of different disciplines and functionality and specialized tools for that. And so for my case with reliability, there's a lot here, which also lends to the value of using JMP is a home for CARMA. But I wanted to point you to the locomotive data set and just show you... this originally came out of a textbook. And talks to it here applied life data analysis. So, in that, there's a problem that asks you what the risk is at 80,000 exposures and we're going to model that today in our data set in an oxidation model is what we've called it, but essentially CARMA will give us an answer. Again, a really simple answer, but I was just going to show you, you can get the same way by clicking in the analysis menu. So we go down to an analyze or liability and survival, life distribution. Put the time and sensor where they need to go. We're going to use Weibull and just the two...so it creates a fit for that data. Two parameters I was going to point out is the beta, 2.3, and then it's called a Weibull alpha here. In our tool, it'll be called Ada, but 183. Okay, so we see how to do that here. Now just to jump over, want to look at a couple of the other files, the input files so I will pull those up. Okay, this is the model file. I mentioned I made three models. And so these are the active models that we're going to be comparing the data against. You'll see that oxidation is the first one, I mentioned that, and then you want...one also, in addition to having model parameters, it has some configuration information. This is just two simple things here (combustion system, fuel capability) I use for examples, but there's many, many more columns, like it. But essentially what CARMA does, one of the things I like about it is when you have a large data set with a lot of different varied configurations, it can go through and find which of those rows of records applies to your model and do the sorting real time, and you know, do that for all the models that you need to do in the data set. And so that's what we're going to use that to demonstrate. Excuse me. Also, just look, jump over to the empirical data for a minute. And just a highlight, we have a sensor, we have exposures, we have the interval that we're going to evaluate those exposures at, modes, and then these are the last two columns I just talked about, combustion system and fuel capability. Okay, so let's start up CARMA. As an add in, so I'll just get it going. And you'll see I already have it pointing to the location I want to use. And today's presentation, I'm not gonna have time to talk through all the variety of features that are in here. But these are all things that can help you take and look at your data and decide the best way to model it, and to do some checks on it before you finalize your models. For the purposes of time, I'm not going to explain all that and demonstrate it, but I just wanted to take a minute to build the three models we talked about create a presentation so you can see that that portion of the functionality. Excuse me, my throat is getting dry all the sudden so I have to keep drinking; I apologize for that. So we've got oxidation. We see the number of failures and suspensions. That's the same as what you'll see in the text. Add that. And let's just scroll down for a second. That's first model added Oxidation. We see the old model had 30 failures, 50 suspensions. This one has 37 and 59. The beta is 2.33, like we saw externally and the ADA is 183. And the answer to the textbook question, the risk of 80,000 exposures is about 13.5% using a Weibull model. So that's just kind of a high level of a way to do that here. Let's look at also just adding the other two models. Okay, we've got cracking, I'm adding in creep. And you'll see in here there's different boxes presented that represent like the combustion system or the fuel capability, where for this given model, this is what the LDM file calls for. But if I wanted to change that, I could select other configurations here and that would result in changing my rows for FNS as far as what gets included or doesn't. And then I can create new populations and segment it accordingly. Okay, so we've gotten all three models added and I think, you know, we're not going to spend more time on that, just playing with the models as far as options, but I'm gonna generate a report. And I have some options on what I want to include into the report. And I have a presentation and this LDM input is going to be the active models, sorry, the running models that come out as a table. All right, so I just need to select the appropriate folder where I want my presentation to go And now it's going to take a minute here to go through and and generate this report. This does take a minute. But I think what I would just contrast it to is the hours that it would take normally to do this same task, potentially, if you were working outside of the tool. And so now we're ready to finalize the report. Save it. And save the folder and now it's done. It's, it's in there and we can review it. The other thing I'll point out, as I pull up, I'd already generated this previously, so I'll just pull up the file that I already generated and we can look through it. But there's, it's this is a template. It's meant for speed, but this can be further customized after you make it, or you can leave placeholders, you can modify the slides after you've generated them. It's doing more than just the life distribution modeling that I kind of highlighted initially. It's doing a lot of summary work, summarizing the data included in each model, which, of course, JMP is very good for. It, it does some work comparing the models, so you can do a variety of statistical tests. Use JMP. And again, JMP is great at that. So that, that adds that functionality. Some of the things our reviewers like to see and how the models have changed year over year, you have more data, include less. How does it affect the parameters? How does it change your risk numbers? Plots of course you get a lot of data out of scatter plots and things of that nature. There's a summary that includes some of the configuration information we talked about, as well as the final parameters. And it does this for each of the three models, as well as just a risk roll up at the end for for all these combined. So that was a quick walkthrough. The demo. I think we we've covered everything I wanted to do. Hopefully we'll get to talk a little more in Q&A if you have more questions. It's hard to anticipate everything. But I just wanted to talk to some of the benefits again. I've mentioned this previously, but we've seen productivity increases as a result of CARMA, so that's a benefit. Of course standardization our modeling process is increased and that also allows team members who are newer to focus more on the process and learning it versus working with tools, which, in the end, helps them come up to speed faster. And then there's also increased employee engagement by allowing engineers to use their minds where they can make the biggest impact. So I also wanted to be sure to thank Melissa Seely, Brad Foulkes, Preston Kemp and Waldemar Zero for their contributions to this presentation. I owe them a debt of gratitude for all they've done in supporting it. And I want to thank you for your time. I've enjoyed sharing our journey towards improvement with you all today. I hope we have a chance to connect in the Q&A time, but either way, enjoy the rest of the summit.

0 attendees

0

Event has ended