Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Brian Corcoran, JMP Director of Research and Development, SAS Dieter Pisot, JMP Principal Application Developer, SAS Eric Hill, JMP Distinguished Software Developer, SAS   You know the value of sharing insights as they emerge. JMP Live — the newest member of the JMP product family — reconceptualizes sharing by taking the robust statistics and visualizations in JMP and extending them to the web, privately and securely. If you'd like a more iterative, dynamic and inclusive path to showing your data and making discoveries, join us. We'll answer the following questions: What is JMP Live? How do I use it? How do I manage it? For background information on the product, see this video from Discovery Summit Tucson 2019 and the JMP Live product page.   JMP Live Overview (for Users and Managers) – Eric Hill What is JMP Live? Why use JMP Live? Interactive publish and replace What happens behind the scenes when you publish Groups - From a user perspective Scripted publishing Stored credentials API Key Replacing reports Setup and Maintenance (for JMP Live Administrators) – Dieter Pisot Administering users and groups Limiting publishing Setting up JMP Live Windows services .env files Upgrading Applying a new license Using Keycloak single sign-on Installing and Setting up the Server (for IT Administrators) – Brian Corcoran Choosing architectural configurations based on expected usage Understanding SSL Certificates and their importance Installing the JMP Live Database component Installing the JMP Pro and JMP Live components on a separate server Connecting JMP Live to the database Testing installed configuration to make sure it is working properly (view in My Videos)     (view in My Videos)   (view in My Videos)
Simon Stelzig, Head of Product Intelligence, Lohmann   JMP, later JMP Pro was used to guide the development of a novel structural adhesive tape from initial experiments towards an optimized product ready for sale. The basis was a 7 component mixture design created by JMP’s custom design function. Unluckily, almost 40% of the runs could be formulated but not processed. Even with this crippled design predictions of processible optima for changing customer requests were possible using a new response and JMP’s model platform. A necessary augmentation of the DoE using the augment design function continuously increased the number of experiments, enabling fine-tuning of the model and finally the prediction of a functioning prototype tape and product. Switching from JMP to JMP Pro, within a follow-up project based on the original experiments, modelling became drastically more efficient and reliable using its better protection against poor modelling, as encountered not using the Pro version. The increasing number of runs and the capabilities from JMP Pro opened the way from classical DoE analysis towards the use of Machine Learning methods. This way, development speed has been increased even further, almost down to prediction and verification, in order to fulfill customer requests, falling in the vicinity of our formulation. (view in My Videos) Editor's note: The presentation that @shs references at the beginning of his presentation is Using REST API Through HTTP Request and JMP Maps to Understand German Brewery Density (2020-EU-EPO-388) 
Carlos Ortega, Project Leader, Avantium Daria Otyuskaya, Project Leader, Avantium Hendrik Dathe, Services Director, Avantium   Creativity is at the center of any research and development program. Whether it is a fundamental research topic or the development of new applications, the basis of solid research rests on robust data that you can trust. Within Avantium, we focus on executing tailored catalysis R&D projects, which vary from customer to customer. This requires a flexible solution to judge the large amount of data that is obtained in our up to 64 reactor high-throughput catalyst testing equipment. We use JMP and JLS scripts to improve the data workflow and its integration. In any given project, the data is generated in different sources, including our proprietary catalyst testing equipment — Flowrence ® —, on-line and off-line analytical equipment (e.g., GC, S&N analyzers and SimDis) or manual data records (e.g., MS Excel files). The data from these sources are automatically checked by our JSL scripts and with the statistical methods available in JMP we are able to calculate key performance parameters, elaborate key performance plots and generate automatic reports that can be shared directly with the clients. The use of scripts guarantees that the data handling process is consistent, as every data set in a given project is treated the same way. This provides seamless integration of results and reports, which are ready-to-share on a software platform known to our customers.     Auto-generated transcript...   Speaker Transcript Carlos Ortega Yeah. Hi, and welcome to our presentation at the JMP Discovery Summit. Of course, we would have liked to give this presentation in person, but under the current circumstances, this is the best way we can still share the way we are using JMP in our day-to-day work and how it helps us actually on the more day-to-day work, how to rework our data. However, the presentation in this way with the video has also an advantage for you as a viewer because yeah if you want to grab a coffee right now you just can hit pause and continue when the coffee is ready. But looking at the time, I guess the summit is right now well under way. And most likely, you heard already quite some exciting presentations. How JMP can help you to make more sense out of the data to solve them a statistical tools to gain deeper insight and dive into more parts of your data. However, what we want to do today (and this is also hidden under the title about the data quality assurance), the scripting engine. Everything, which has to do with JSL scripting because we help...this helps us a lot on our day-to-day work to prepare the data, which are then ready to be used for data analysis and by we I mean Carlos Ortega, Daria Otyuskaya, and myself, which I now want to introduce a bit because, yeah, that's the to get a bit better feeling on who's doing this. But of course, as usual, there are some some rules to this, which are the disclaimer about the data we are using. And now if you're a lawyer for sure you're going to press pause to study this in detail. However, for all other people right now, let's dive into the presentation. And of course nothing better than to start with a short introduction of the people you see, you see already the location. We all have in common, which is Amsterdam in the Netherlands and we all have in common that we work at Avantium. company provider for sustainable technologies. However, the different locations we are coming from is all over the world. We have, on the one hand side on the left side, Carlos Ortega, a chemical engineer from Venezuela, which lives in Holland, about six years and works at Avantium about two years as a project leader and services. Then we have on the right side Daria Otyuskaya from Russia also working here for about two years and spending the last five years in the Benelux area where she made her PhD in chemical engineering. And myself. I have the only advantage, can that I can travel home by car as I origin from Germany. I live in Holland since about 10 years and join Avantium about three years ago. But now, let's talk a bit more about Avantium. I just want to briefly lay out a bit of the things we are doing. Avantium, as I mentioned before, provider for sustainable technologies and has three business units. One is Avantium Renewable Polymers, where we actually develop biodegradable polymer called a PEF, which is hundred percent plant based and recyclable. Second, we have a business unit called Avantium Renewable Chemistries, which offers renewable technologies to produce chemicals like MEG or industrial sugars from non food biomass. And last but not least, a very exciting technologies, where we turn CO2 from the air into chemicals via electro chemistry. But not too much to talk about these two business units because Carlos, myself and Daria are all working in the Avantium Catalysis, which was founded in 20 years ago and it's still the founding...the fundamental of Avantiums technology innovations. We are actually providing their We are a service provider in accelerating the research in your company in the catalysts research, to be more more specific. And we offer there, as you can see on the right hand side, systems services and a service called refinery catalyst testing. And what we help companies really to develop the R&D, as you see at the bottom for this. But this is enough about Avantium. Let's talk a bit how we are developing how we are working in projects and how JMP actually can help us there to accelerate the stuff and get better data out of it, which Carlos then later on the show in a demo for us. As mentioned before, we are a service provider and as a service provider, we get a lot of requests from customers to actually develop better catalysts, or better process. And now you might ask yourself, what's the catalyst. A catalyst is actually a material which participates in a reaction when you transform A to A, but doesn't get consumed in a reaction. The most common example of people, which you can see in your day-to-day life is, for example, the exhaust gas catalyst which is installed in your car, which turns off gases from your ...from your car into CO2 and water as an exhaust. And this is things which we get as requests. People come to us and say, "Oh, I would like to develop a new material," or things like, "I have this process, and I want to come with...accelerate my research and Develop a new process for this." And what they use there is when we have an experiment in our team, we are designing experiments of... designing experiments. We are trying to optimize the testing for this and is all we use JMP, but this is not what we want to talk today about. Because as I said before, we are using JMP also to actually merge our data, process them and make them ready for things, which is the two parts, which you see at the bottom of the presentation. We are executing research projects for customer in our proprietary tool called Flowrence, where the trick is that we don't experiment...don't execute tests, one after another, but we execute in parallel. Traditionally, I mean, I remember myself in my PhD, you execute a test one reactor after another, after another, after another. But we are applying up to 64 reactors in parallel, which makes the execute more challenging but allows a data-driven decision. It allows actually to make more reliable data and make them statistically significant. And then we are reporting this data to our customers, which then can either to continue in their tools with their further insights or completely actually rely on us for executing this data and extracting the knowledge. But yeah, enough about the company. And now let me hand over to Carlos, which will explain how JMP and JMP script actually helps us to make us our life significantly easier. Thank you, Hendrik,for the nice introduction. And thank you also for the organizers for this nice opportunity to participate in the JMP discovery summit. So as Hendrik was mentioning, we develop and execute research projects for third parties. And if we think about it, we need to go from design of experiments (and that's of course one very powerful feature from JMP), but also we need to manage information and in this case, as Hendrik was was mentioning, we want to focus on JSL script that allows us to easily handle information and create seamless integration of a process workflows. I'm a project leader in the R&D department and so a day...a regular day in my life here would look something like this. And so very simplistic view. You would have clients who are interested and have a research question and I design experiments and we execute these in our own proprietary technology called Flowrence. So in a simple view the data generated in the Flowrence unit will go through me after some checks and interpretation will goes back to the client. But the reality is somewhat more complex and on one hand, we also have internal customers. That is part of...for example our development team...business development team. And on the other side, we also have our own staff that actually interacts directly with the unit. So they control how the unit operates and monitor everything goes according to the plan. And the data, as you see here with broken lines, the data cannot be struck directly from the unit. The data is actually sent to a data warehouse and then we need a set of tools that allows us to first retrieve information, merge information that comes from different sources, execute a set of tasks that go from cleaning, processing, visualizing information, and eventually we export that data to the client so that the client can get the information that they actually need and that is most relevant for them. If you'll allow me to focus for one second on these different tasks, what we observed initially in the retrieve a merge is that data can actually come from various sources. So in the data warehouse, we actually collect data from the Florence unit, but we also collect data from the analyzer. So for those that they're performing tests in a laboratory, you might be familiar with the mass spectrometry or gas chromatography, for example, and we also collect data on the unit performance. So we also verify that the unit is is behaving as expected. In...as in any laboratory, we would also have manual inputs. And these could be, for example, information on the catalysts that we are testing or calibration of the analytical equipment. Those manual inputs are always of course stored in a laboratory notebook, but also we include that information into an Excel file. And this is where JMP is actually helping us drive the work flow of information to the next level. So what we have developed is a combination of an easy to use vastly known Excel file with powerful features from a JSL script. And not only we include manual data that is available in laboratory notebooks, but we also include in this Excel file formulas that are then interpreted by the JSL script and executed. That allows us to calculate key performance parameters that are tailored or specifically suited for different clients. If we look in more detail into the JSL script, and in a moment I will go into a demo, you will observe that the JSL script has three main sections. One section will prepare the local environment. So on one side we would say we want to clear all the symbols and close tables, but probably the most important feature is when we define "names default to here." So that would allow us actually to run parallel scrapes without having any interference between variables that are named the same in different scripts. Then we have section that is collapsed in this case so that we can show it actually that creates a graphical user interface. And then the user does not interact with the script itself, but actually works through a simple graphical user interface with the buttons that have descriptive button names. And then we have a set of tasks that are already coded in the script. In this case, they are in the form of expressions. Because well, it has two main advantages. One would be a it's easy to later on implement on the graphical user interface. And second, when you have an expression, you can use this expression several times in your code. OK, so moving on into the demo simulation. So I mentioned earlier that we have different sources of data. And on one side we have data that is in fact... that is in fact stored in our database. And this database will contain probably different sources of information, like the unit or different analyzers. In this case, you will see or you see an example Excel table. This only for illustration. So this data is actually taken from the data warehouse directly with our JSL script. So we don't look at this Excel table as a search. We let the software collect the information from the data warehouse. And probably what is most important is that this data, as you see here, can come again from different analyzers, and we're structuring somehow that the first column contains divided names. In this case, we have made some domain names. So, for reasons of confidentiality, but also you will see that all the observations are arranged in rows. So every single row is an observation. And depending on the type of test and the unit we are using, we could think that overall in one day we can collect up to half a million data points in one single day. That depends of course on the analyzer, but you immediately are faced with the amount of data that you have to handle and how JSL script that helps you process information can help you with this activity. Then we also use another Excel file. And this one is also very important, which is an input table file. And this files, specifically with the JSL script, are the ones creating the synergy to allows us to process data easy. What you see in this case, for example, is a reactor loading table and we see different reactors with different catalysts. And this information that seems... is not quantitative, but the qualitive the value is important. And then if we move to a second tab, and these steps are all predefined across our projects, we see the response factors for the analyzers. Different analyzers will have different response factors and it's important to log this information into use through the calculations to be able to get quantitative results. In this case, we observed that the condition that the response factors are targeted by condition instead. Then we have a formula tab. And this is probably a key tab for our script. You can input formulas in this Excel file. You make sure that the variable names are enclosed into square brackets. And the formula, you can use any formula in Excel. Anyone can use Excel; we're very much used to it. So if you type a formula here, that follows ??? syntax in Excel, it will be executed by our JSL script. Then we also included an additional feature we thought it was interesting to have conditionals. And for the JSL script to read this conditional, the only requirement is that the conditionals are enclosed in braces. There are two other tabs I would like to show you, which are highly relevant. One is a export tables tab and the reason that we have this table is because we generate many columns or many variables from my unit, probably 500 variables. But actually the client is only interested in 10, 20 or 50 of them. Those are the ones that really add value to their research. So we can input those variables here and send it to the client. And last but not least, I think many of us have been in that situation where we send an email to a wrong address and that can be actually something frightening when you're talking about confidential information. So we always double, triple check the email addresses and but does it...is it really necessary? So what we are doing here is that we have one Excel file that contains all manual inputs, including the email address of our clients. And these email addresses are fixed so there is no room for error. Whenever you have run the JSL script the right email addresses will be read and the email will be created and these we will see in one minute. So now going into the JSL script, I would like to highlight the following. So the JSL script is initially located in one single file in one single folder and the JSL script only needs one Excel file to write that contains different tabs that we just saw in the previous slide Once you open the JSL script, you can click on the run script button and that will open the graphical user interface that you see on the right. Here we have different options. In this case we want to highlight the option where we retrieve data from a project in that given period. We have selected here only one day this year, in particular, and then we see different buttons that allows us to create updates, for example. Once we have clicked on this button, you will see to the left on the folder that two directories were created. The fact that we create these directories automatically help us to have harmony or to standardize how is a folder structured also across our projects. If you look into the raw database data, you will see the two files were created. One contains the raw data that comes directly from the data warehouse. And the second, the data table contains all merge information from the Excel file and different tables that are available in the data warehouse. The exported files folder does not contain anything at this moment, because we have not evaluated and assessed the data that we created in our unit is actually relevant and valuable for the client. We do this, we are, we ??? and you see here that we have created a plot of reactor temperature versus the local time. And different reactors would be plotted so we have up to 64 in one of our units. And in this case we color the reactors, depending on the location on the unit. Another tab we have here, as an example, is about the pressure. And you see that you can also script maximum target and minimum values and define, for example, alerts to see if value is drifting away. The last table I want to show is a conversion and we see here different conversions collapsed by catalyst. So once we click the export button, we will see that our file is attached into an email and the email already contains the addresses...the email addresses we want to use. And again, I want to highlight how important it is to send the information to the right person. Now this data set is actually located into the exported files folder, which was not there before. And we always can keep track of what information has been exported and sent to the client. With this email then it's only a matter of filling in the information. So in this case, it's a very simple test. So this is your data set, but of course we would give some interpretation or gave maybe some advice to the client on how to continue the tests. And of course, once you have covered all these steps you will close the graphical user interface and that will also close all open tables and the JSL script. Something that I would like to highlight at this point is that these workflow using a JSL script is is rather fast. So what you saw at this moment, of course, it's a bit accelerated because it's only a demonstration, but you don't spend time looking for data and different sources, trying to merge them with the right columns. All these processes are integrated into a single script and that allows us to report to the client on a daily basis amounts of data that otherwise would be would...would not be possible. And the client can actually take data driven decisions with a very fast pace. That's probably the key message that I want to deliver with with this script that we see at this moment. Now, well, I would like to wrap up the presentation with with some concluding remarks and some closing remarks. And so on one side, we developed a distinctive approach for data handling and processing. And when we say distinctive it's because we have created a synergy between an Excel file that most people can use because you are very familiar with Microsoft Office and a JSL script which doesn't need any effort to run. So you click Run, you will get a graphical user interface and a few buttons to execute tasks. Then we have a standardized workflow. And that's also highly relevant when you work with multiple clients and also also from a practical point of view. For example, if one of my colleagues would go on holiday, it will be easy for another project leader for myself, for example, to take over the project and know that all the folder structures are the same, that all the scripts are the same and the buttons execute the same actions. Finally, we can also...we can guarantee seamless integration of data and these fast updates of information with thousands or even half a million data points per day can be quickly sent to clients and then this allows them to take almost online data driven decisions. At the end, our purpose is to maximize the customer satisfaction through a consistent, reliable and robust process. Well, with this, I would like to thank, again, the organizers of these discovery summit. Of course, to all our colleagues at Avantium, who have made this possible, especially to those that have worked intensively on the development of these scripts. And if you are curious about our company or the work we do in Catalysis, please visit one of the links you see here. And with this, I'd like to conclude, thank you very much for for your attention. And yeah, we look forward to your questions.  
Wenjun Bao, Chief Scientist, Sr. Manager, JMP Life Sciences, SAS Institute Inc Fang Hong, Dr., National Center for Toxicological Research, FDA Zhichao Liu, Dr., National Center for Toxicological Research, FDA Weida Tong, Dr., National Center for Toxicological Research, FDA Russ Wolfinger, Director of Scientific Discovery and Genomics, JMP Life Sciences, SAS   Monitoring the post-marketing safety of drug and therapeutic biologic products is very important to the protection of public health. To help facilitate the safety monitoring process, the FDA has established several database systems including the FDA Online Label Repository (FOLP). FOLP collects the most recent drug listing information companies have submitted to the FDA. However, navigating through hundreds of drug labels and extracting meaningful information is a challenge; an easy-to-use software solution could help.   The most frequent single cause of safety-related drug withdrawals from the market during the past 50 years has been drug-induced liver injury (DILI). In this presentation we analyze 462 drug labels with DILI indicators using JMP Text Explorer. Terms and phrases from the Warnings and Precautions section of the drug labels are matched to DILI keywords and MedDRA terms. The XGBoost add-in for JMP Pro is utilized to predict DILI indicators through cross validation of XGBoost predictive models by the term matrix. The results demonstrate that a similar approach can be readily used to analyze other drug safety concerns.        Auto-generated transcript...   Speaker Transcript Olivia Lippincott What wenjba It's my pleasure to talk about this, Obtain high quality information from FDA drug labeling system and in the JMP discovery. And today I'm going to talk about four portions. The first, I'll give some background information about drug post marketing monitoring and what is the effort from the FDA regulatory agency and industry. And also, I'm going to use a drug label data set to analyze the text and using the Text Explorer in JMP and also use the add in JMP add in XGBoost to analyze this DILI information and then give the conclusion and also the XGBoost tutorial by Dr. Russ Wolfinger is also present in this JMP Discovery Summit so please go to his tutorial if you're interested in XGBoost. So the drug development, according to FDA description for the drug processing, it can be divided by five stages and the first two stages, discover and research preclinical, many in the in the for the animal study and chemical screen, and then later three stages involve the human. And JMP has three products, including JMP Genomics, JMP Clinical, JMP and JMP Pro, that have covered every stage of the drug discovery. And JMP Genomics is the omics system that can be used for omics and clinical biomarkers selection and JMP Clinical is specific for the clinical trial and post marketing monitoring for the drug safety and efficacy. And also for the JMP Pro can be used for drug data cleaning, mining, target identification, formulation development, DOE, QbD, bioassay, etc. So it can be used every stage of the drug development. So in this drug development, there's most frequent single cause, called a DILI, actually can be stopped for the clinical trial. The drug can be rejected for approval by the FDA or the other regulatory agency, or be recalled once the drug is on market. So this is the most frequent the single cause called the DILI and can be found the information in the FDA guide and also other scientific publications. So what is DILI? This actually is drug-induced liver injury, called DILI and you have FDA, back almost more than 10 years ago in 2009, they published a guide for DILI, how to evaluation and follow up, FDA offers multiple years of the DILI training for the clinical investigator and those information can still find online today. And they have the conferences, also organized by FDA, just last year. And of course for the DILI, how you define the subject or patient could have a DILI case, they have Hy's Law that's included in the FDA guidance. So here's an example for the DILI evaluation for the clinical trial, here in the clinical trial, in the JMP Clinical by Hy's Law. So the Hy's Law is the combination condition for the several liver enzymes when they elevate to the certain level, then you would think it would be the possible Hy's Law cases. So you have potentially that liver damages. So here we use the color to identify the possible Hy's Law cases, the red one is a yes, blue one is a no. And also the different round and the triangle were from different treatment groups. We also use a JMP bubble plot to to show the the enzymes elevations through the time...timing...during the clinical trial period time. So this is typical. This is 15 days. Then you have the subject, starting was pretty normal. Then they go kind of crazy high level of the liver enzyme indicate they are potentially DILI possible cases. So, the FDA has two major databases, actually can deal with the post-marketing monitoring for the drug safety. One is a drug label and which we will get the data from this database. Another one is FDA Adverse Event Reporting System, they then they have from the NIH and and NCBI, they have very actively built this LiverTox and have lots of information, deal with the DILI. And the FDA have another database called Liver Toxic Knowledge Base and there was a leading by Dr. Tong, who is our co are so in this presentation. They have a lot of knowledge about the DILI and built this specific database for public information. So drug label. Everybody have probably seen this when you get prescription drug. You got those wordy thing paper come with your drug. So they come with also come with a teeny tiny font words in that even though it's sometimes it's too small to read, but they do contain many useful scientific information about this drug Then two potions will be related to my presentation today, would be the sections called warnings and precautions. So basically, all the information about the drug adverse event and anything need be be warned in these two sections. And this this drug actually have over 2000 words describe about the warnings and precautions. And fortunately, not every drug has that many side effects or adverse events. Some drugs like this, one of the Metformin, actually have a small section for the warning and precautions. So the older version of the drug label has warnings and precautions in the separate sections, and new version has them put together. So this one is in the new version they put...they have those two sections together. But this one has much less side effects. So JMP and the JMP clinical have made use by the FDA to perform the safety analysis and we actually help to finalize every adverse event listed in the drug labels. So this is data that got to present today. So we are using the warning and precaution section in the 462 drug labels that extracted by the FDA researchers and I just got from them. And the DILI indicator was assigned to each drug. 1 is yes and the zero is no. So from this this distribution, you can see there's about one...164 drugs has potential DILI cases and 298 doesn't and the original format for the drug label data is in the XML format and that can be imported by JMP multiply at once. So for the DILI keywords and was a many years effort by the FDA to come up this keyword list. Then they actually by the expert, reading hundreds of drug label and then decided what could potentially become the DILI cases. So then they come up with those about 44 words or terms to be indicated as a keyword, to be indicated for the drug could be the DILI cases. And you may also heard about MedDRA, which is a medical dictionary for regulatory activities. They have different levels of a standardized terms and most popular one is preferred term. I'm going to be using today. So in the warning and precaution, you can see if we pull everything together, you have over 12,000 terms in the warnings and the precautions section. And you can see that "patients" and "may" is a dominant which made not...should not be related to the medical cases and the medical information in this case. So we can remove that, you can see that not any other words are so dominant in this word cloud, but it still have many medical unrelated words like "use" and like "reported" that we could put into... could remove them to our analysis list. So in the in the Text Explorer, we can put them into the stop word and also we normally were using the different Text Explorer technology is stemming, tokenizing, regex, recoding and deleting manually. to clean up the list. But it had 12,000 terms, so it could be very time consuming. But since we have the list we are interested in, so we want to take advantage that we already knew what we are interested in the terms in this system. So what we're going to do and I'm going to show you in the demo that we'll only use the DILI keywords, plus the preferred term from the MedDRA to generate the interesting terms and the phrases to do the prediction. So here is the example we saw using only the DILI keywords. Then you see everything over here, you can see even in the list. You have a count number showed at the side for each of terms, how many times they are repeated in the warnings and precaution section and also you can see more colorful, more graphic in the world cloud to get a pattern recognized. And then we add the medical terms, that was the medical related terms. So it's still come down from the 12,000 terms to the 1190 terms that was including DILI keywords and medical preferred terms. So we think this would be the good term list to start with to do our analysis. So what we do is in the JMP Text Explorer, we can save the term...document term matrix. That means if you see 1 that means this document have seen this term, if it says, if this is 0, this means this document has not see, have a case of this word. So then we, in the XGBoost will make k fold, and three k folds, use each one with five columns. So we use in this machine learnign and use XGBoost tree model which is add in for the JMP Pro and we...using the DILI indicator to as a target variable and they use the DILI keywords and also the MedDRA preferred terms that have shown up more than 20 times to...as a predictor. Then we use a cross validation XGBoost then it 300 times interation. Now we got statistical performance metrics, we get term importance to DILI, and we get, we can use the prediction profiler for interactions and also we can generate and the save the prediction formula for new drug prediction. So I'm going to the demo. So this is a sample table we got in the in JMP. So you have a three columns. Basically you have the index, which is a drug ID. Then you have the warnign and precaution, it could have contain much more words that it's appeared, so basically have all the information for each drug. Now you have a DILI indicator. So we do the Text Explorer first. We have analysis, you go to the Text Explorer, you can use this input, which is a warning and precaution text and you would you...normally you can do different things over here, you can minimize characters, normally people go to 2 or do other things. Or you could use the stemming or you could use the regex and to do all kind of formula and in our limitation can be limited. For example, you can use a customize regex to get the all the numbers removed. That's if only number, you can remove those, but since we're going to use a list, we'll not touch any of those, we can just go here simply say, okay, So it come up the whole list of this, everything. So now I'm going to say, I only care about oh, for this one, you can do...you can show the word cloud. And we want to say I want to center it and also I want to the color. So you see this one, you see the patient is so dominant, then you can say, okay this definitely...not the... should not be in the in analysis. So I just select and right click add stop word. So you will see those being removed and no longer showed in your list and no longer show in the word cloud. So now I want to show you something I think that would speed up the clean up, because there's so many other words that could be in the system that I don't need. So I actually select and put everything into the stop word. So I removed everything, except I don't know why the "action" cannot be removed. And but it's fine if there's only one. So what I do is I go here. I said manage phrase, I want to import my keywords. Keyword just have a... very simple. The title, one column data just have all the name list. So I import that, I paste that into local. This will be my local library. And I said, Okay. So now I got only the keyword I have. OK, so now this one will be...I want to do the analysis later. And I want to use all of them to be included in my analysis because they are the keywords. So I go here, the red triangle, everything in the Text Explorer, hidden functions, hidden in this red triangle. So I say save matrix. So I want to have one and I want 44 up in my analysis. I say okay. So you will see, everything will get saved to my... to the column, the matrix. So now I want to what I want to add, I want to have the phrase, one more time. I also want to import those preferred terms. into the my database, my local data. Then also, I want to actually, I want to locally to so I say, okay. So now I have the mix, both of the the preferred terms from the MedDRA and also my keywords. So you can see now the phrases have changed. So that I can add them to my list. The same thing to my safe term matrix list and get the, the, all the numbers...all the terms I want to be included. And the one thing I want to point out here is for these terms and they are...we need to change the one model format. This is model type is continuing. I want to change them to nominal. I will tell you why I do that later. So now I have, I can go to the XGBoost, which is in the add in. We can make...k fold the columns that make sure I can do the cross validation. I can use just use index and by default is number of k fold column to create is three and the number of folds (k) is within each column is five, we just go with the default. Say, okay, it will generate three columns really quickly. And at the end, you are seeing fold A, B, C, three of them. So we got that, then we have... Another thing I wanted to do is in the... So we can We can create another phrase which has everything...that have have everything in...this phrase have everything, including the keywords and PT, but I want to create one that only have the only have only have the the preferred term, but not have the keyword, so I can add those keywords into the local exception and say, Okay. So those words will be only have preferred terms, but not have the keywords. So this way I can create another list, save another list of the documentation words than this one I want to have. So have 1000, but this term has just 20. So what they will do is they were saved terms either meet... have at least show up more than 20 times or they reach to 1000, which one of them, they will show up in the my list. So now I have table complete, which has the keywords and also have the MedDRA terms which have more than 20, show more than 20 times, now also have ??? column that ready for the analysis for the XGBoost. So now what I can do is go to the XGBoost. I can go for the analysis now. So what I'm going to do show you is I can use this DILI indicator, then the X response is all my terms that I just had for the keyword and the preferred words. Now, I use the three validation then click OK to run. It will take about five minutes to run. So I already got a result I want to show you. So you have... This is what look like. The tuning design. And we'll check this. You have the actual will find a good condition for you to to to do so. You can also, if you have as much as experience like Ross Wolfinger has, he will go in here, manually change some conditions, then you probably get the best result. But for the many people like myself don't have many experienced in XGBoost, I would rather use this tuning design than just have machine to select for me first, then I can go in, we can adjust a little bit, it depend on what I need to do here. So this is a result we got. You can use the...you can see here is different statistic metrics for performance metrics for this models and the default is showed only have accuracy and you can use sorting them by to click the column. You can sorting them and also it has much more other popular performance metrics like MCC, AUC, RMSE, correlation. They all show up if you click them. They will show up here. So whatever you need, whatever measurement you want to do, you can always find here. So now I'm going to use, say I trust the validation accuracy, more than anything else for this case. So I want to do is I want to see just top model, say five models. So what here is I choose five models. Then I go here, say I want to remove all the show models. So you will see the five models over here and then you can see some model, even though the, like this 19 is green, it doesn't the finish to the halfway. So something wrong, something is not appropriate for this model. I definitely don't want to use that one, so others I can choose. Say I want to choose this 19, I want to remove that one. So I can say I want to remove the hidden one. So basically just whatever you need to do. So if you compare, see this metrics, they're actually not much, not much different. So I want to rely on this graphic to help me to choose the best one to do the performance. So then you choose the good one. You can go here to say, I like the model 12 so I can go here, say I want to do the profiler. So this is a very powerful tool, I think quite unique to JMP. Not many tools have this function. So this gives you an opportunity to look at individual parameters in the in the active way and see how they how they change the result. For example those two was most frequently show up in the DILI cases. And you can see the slope is quite steep and that means if you change them, they will affect the final result predictions quite a bit. So you can see when the hepatitis and jaundice both says zero, you actually have very low possibility to get the DILI as one. So is low case for the possible DILI cases. But if you change this line, to the 1, you can see the chance you get is higher. And if you move those even higher. So you have, you will have a way to analyze, if they are the what is the key parameters or predictor to affect your result. And for this, some of them, even their keyword, they're pretty flat. So that means if you change that, it will not affect the result that much. So So this is and also we here, we gave the list you can get to to see what is the most important features to the calculate variables prediction. So you can see over here is jaundice and others are quite important. And for the for the feature result, once you get the data in, this is all the results that we we have. And you can say, well, what...how about the new things coming? Yes, we have here, you can say, I want to save prediction formula. And you can see it's actively working on that. And then in the table, by the end of table, you will see the prediction. So remember we had one...this was, say, well, the first drug, second was pretty much predict it will be the DILI cases and the next two, third, and the fourth, and the fifth was close to zero. So we go back to this DILI indicator and we found out they actually list. The first five was right one. So, in case you have...don't have this indicator when you have the new data come in, you don't have to read all the label. You run the model. You can see the prediction. Pretty much you knew if it is it is DILI cases or not. So my deomo would be end here, and now I'm going to give a conclusion. So we are using the Text Explorer to extract the data keyword and MedDRA terms using Stop Words and Phrase Management without manually selection, deletion and recoding. So we use a visualization and we created a document term matrix for prediction. And also we use machine learning for the using the XGBoost modeling and we want to quickly to run the XGBoost to find the best model and perform predict profile. And also we can save the predict formula to predict the new cases. Thank you. And I stop here.  
Monday, October 12, 2020
Jordan Hiller, JMP Senior Systems Engineer, JMP Mia Stephens, JMP Principal Product Manager, JMP   For most data analysis tasks, a lot of time is spent up front — importing data and preparing it for analysis. Because we often work with data sets that are regularly updated, automating our work using scripted repeatable workflows can be a real time saver. There are three general sections in an automation script: data import, data curation, and analysis/reporting. While the tasks in the first and third sections are relatively straightforward — point-and click to achieve the desired result and capture the resulting script — data curation can be more challenging for those just starting out with scripting. In this talk we review common data preparation activities, discuss the JSL code necessary to automate the process, and provide advice for generating JSL code for data curation via point-and-click.     The Data Cleaning Script Assistant Add-in discussed in this talk can be found in the JMP File Exchange.     Auto-generated transcript...   Speaker Transcript mistep Welcome to JMP Discovery summit. I'm Mia Stephens and I'm a JMP product manager and I'm here with Jordan Hiller, who is a JMP systems engineer. And today we're going to talk about automating the data curation workflow. And we're going to split our talk into two parts. I'm going to kick us off and set the stage by talking about the analytic workflow and where data curation fits into this workflow. And then I'm going to turn it over to Jordan for the meat, the heart of this talk. We're going to talk about the need for reproducible data curation. We're going to see how to do this in JMP 15. And then you're going to get a sneak peek at some new functionality in JMP 16 for recording data curation steps and the actions that you take to prepare your data for analysis. So let's think about the analytic workflow. And here's one popular workflow. And of course, it all starts with defining what your business problem is, understanding the problem that you're trying to solve. Then you need to compile data. And of course, you can compile data from a number of different sources and pull these data in JMP. And at the end, we need to be able to share results and communicate our findings with others. Probably the most time-consuming part of this process is preparing our data for analysis or curating our data. So what exactly is data curation? Well, data curation is all about ensuring that our data are useful in driving analytics discoveries. Fundamentally, we want to be able to solve a problem with the day that we have. This is largely about data organization, data structure, and cleaning up data quality issues. If you think about problems or common problems with data, it generally falls within four buckets. We might have incorrect formatting, incomplete data, missing data, or dirty or messy data. And to talk about these types of issues and to illustrate how we identify these issues within our data, we're going to borrow from our course, STIPS And if you're not familiar with STIPS, STIPS is our free online course, Statistical Thinking for Industrial Problem Solving, and it's set up in seven discrete modules. Module 2 is all about exploratory data analysis. And because of the interactive and iterative nature of exploratory data analysis and data curation, the last lesson in this module is data preparation for analysis. And this is all about identifying quality issues within your data and steps you might take to curate your data. So let's talk a little bit more about the common issues. Incorrect formatting. So what do we mean by incorrect formatting? Well, this is when your data are in the wrong form or the wrong format for analysis. This can apply your data table as a whole. So, for example, you might have your data in separate columns, but for analysis, you need your data stacked in one column. This can apply to individual variables. You might have the wrong modeling type or data type or you might have date data, data on dates or times that's not formatted that way in JMP. It can also be cosmeti. You might choose to remove response variables to the beginning of the data table, rename your variables, group factors together to make it easier to find them with the data table. Incomplete data is about having a lack of data. And this can be on important variables, so you might not be capturing data on variables that can ultimately help you solve your problem or on combinations of variables. Or it could mean that you simply don't have enough observations, you don't have enough data in your data table. Missing data is when values for variables are not available. And this can take on a variety of different forms. And then finally, dirty or messy data is when you have issues with observations or variables. So your data might be incorrect. The values are simply wrong. You might have inconsistencies in terms of how people were recording data or entering data into the system. Your data might be inaccurate, might not have a capable measurement system, there might be errors or typos. The data might be obsolete. So you might have collected the information on a facility or machine that is no longer in service. It might be outdated. So the process might have changed so much since you collected the data that the data are no longer useful. The data might be censored or truncated. You might have columns that are redundant to one another. They have the same basic information content or rows that are duplicated. So dirty and messy data can take on a lot of different forms. So how do you identify potential issues? Well, when you take a look at your data, you start to identify issues. And in fact, this process is iterative and when you start to explore your data graphically, numerically, you start to see things that might be issues that you might want to fix or resolve. So a nice starting point is to start by just scanning the data table. When you scan your data table, you can see oftentimes some obvious issues. And for this example, we're going to use some data from the STIPS course called Components, and the scenario is that a company manufactures small components and they're trying to improve yield. And they've collected data on 369 batches of parts with 15 columns. So when we take a look at the data, we can see some pretty obvious issues right off the bat. If we look at the top of the data table, we look at these nice little graphs, we can see the shapes of distributions. We can see the values. So, for example, batch number, you see a histogram. And batch number is something you would think of being an identifier, rather than something that's continuous. So this can tell us that the data coded incorrectly. When we look at number scrapped, we can see the shape of the distribution. We can also see that there's a negative value there, which might not be possible. we see a histogram for process with two values, and this can tell us that we need to change the modeling type for process from continuous to nominal. You can see more when you when you take a look at the column panel. So, for example, batch number and part number are both coded as continuous. These are probably nominal And if you look at the data itself, you can see other issues. So, for example, humidity is something we would think of as being continuous, but you see a couple of observations that have value N/A. And because JMP see text, the column is coded as nominal, so this is something that you might want to fix. we can see some issues with supplier. There's a couple of missing values, some typographical errors. And notice, temperature, all of the dots indicate that we're missing values for temperature in these in these rows. So this is an issue that we might want to investigate further. So you identify a lot of issues just by scanning the data table, and you can identify even more potential issues when you when you visualize the data one variable at a time. A really nice starting point, and and I really like this tool, is the column viewer. The column viewer gives you numeric summaries for all of the variables that you've selected. So for example, here I'm missing some values. And you can see for temperature that we're missing 265 of the 369 values. So this is potentially a problem if we think the temperature is an important factor. We can also see potential issues with values that are recorded in the data table. So, for example, scrap rate and number scrap both have negative values. And if this isn't isn't physically possible, this is something that we might want to investigate back in the system that we collected the data in. Looking at some of the calculated statistics, we can also see other issues. So, for example, batch number and part number really should be categorical. It doesn't make sense to have the average batch number or the average part number. So this tells you you should probably go back to the data table and change your modeling type. Distributions tell us a lot about our data and potential issues. We can see the shapes of distributions, the centering, the spread. We can also see typos. Customer number here, the particular problem here is that there are four or five major customers and some smaller customers. If you're going to use customer number and and analysis, you might want to use recode to group some of those smaller customers together into maybe an other category. we have a bar chart for humidity, and this is because we have that N/A value in the column. And we might not have seen that when we scan the data table, but we can see it pretty clearly here when we look at the distribution. We can clearly see the typographical errors for supplier. And when we look at continuous variables, again, you can look at the shape, centering, and spread, but you can also see some unusual observations within these variables. So, after looking at the data one variable at a time, a natural, natural progression is to explore the data two or more variables at a time. So for example, if we look at scrap rate versus number scrap in the Graph Builder. We see an interest in pattern. So we see these these bands and it could be that there's something in our data table that helps us to explain why we're seeing this pattern. In fact, if we color by batch size, it makes sense to us. So where we have batches with 5000 parts, there's more of an opportunity for scrap parts than for batches of only 200. We can also see that there's some strange observations at the bottom. In fact, these are the observations that had negative values for the number of scrap and these really stand out here in this graph. And when you add a column switcher or data filter, you can add some additional dimensionality to these graphs. So I can look at pressure, for example, instead of... Well, I can look at pressure or switch to dwell. What I'm looking for here is I'm getting a sense for the general relationship between these variables and the response. And I can see that pressure looks like it has a positive relationship with scrap rate. And if I switch to dwell, I can see there's probably not much of a relationship between dwell and scrap rate or temperature. So these variables might not be as informative in solving the problem. But look at speed, speed has a negative relationship. And I've also got some unusual observations at the top that I might want to investigate. So you can learn a lot about your data just by looking at it. And of course, there are more advanced tools for exploring outliers and missing values that are really beyond the scope of this discussion. And as you get into the analyze phase, when you start analyzing your data or building models, you'll learn much much more about potential issues that you have to deal with. And the key is that as you are taking a look at your data and identifying these issues, you want to make notes of these issues. Some of them can be resolved as you're going along. So you might be able to reshape and clean your data as you proceed through the process. But you really want to make sure that you capture the steps that you take so that you can repeat the steps later if you have to repeat the analysis or if you want to repeat the analysis on new data or other data. And at this point is where I'm going to turn it over to to Jordan to talk about reproducible data curation and what this is all about. Jordan Hiller Alright thanks, Mia. That was great. And we learned what you do in JMP to accomplish data curation by point and click. Let's talk now about making that reproducible. The reason we worry about reproducibility is that your data sets get updated regularly with new data. If this was a one-time activity, we wouldn't worry too much about the point and click. But when data gets updated over and over, it is too labor-intensive to repeat the data curation by point and click each time. So it's more efficient to generate a script that performs all of your data curation steps, and you can execute that script with one click of a button and do the whole thing at once. So in addition to efficiency, it documents your process. It serves as a record of what you did. So you can refer to that later for yourself and remind yourself what you did, or for people who come after you and are responsible for this process, it's a record for them as well. For the rest of this presentation, my goal is to show you how to generate a data curation script with point and click only. We're hoping that you don't need to do any programming in order to get this done. That program code is going to be extracted and saved for you, and we'll talk a little bit about how that happens. So there are two different sections. There's what you can do now in JMP 15 to obtain a data curation script, and what you'll be doing once we release JMP 16 next year. In JMP 15 there are some data curation tasks that generate their own reusable JSL scripting code. You just execute your point and click, and then there's a technique to grab the code. I'm going to demonstrate that. So tools like recode, generating a new formula column with the calculation, reshaping data tables, these tools are in the tables menu. There's stack, split, join, concatenate, and update. All of these tools in JMP 15 generate their own script after you execute them by point and click. There are other common tasks that do not generate their own JSL script and in order to make it easier to accomplish these tasks and make them reproducible, And it helps with the following tasks, mostly column stuff, changing the data types of columns, the modeling types, changing the display format, renaming, reordering, and deleting columns from your data table, also setting column properties such as spec limits or value labels. So the Data Cleaning Script Assistant is what you'll use to assist you with those tasks in JMP 15. We are going to give you a sneak preview of JMP 16 and we're very excited about new features in the log in JMP 16, I think it's going to be called the enhanced log mode. The basic idea is that in JMP 16 you can just point and click your way through your data curation steps as usual. The JSL code that you need is generated and logged automatically. All you need to do is grab it and save it off. So super simple and really useful, excited to show that to you. Here's a cheat sheet for your reference. In JMP 15 these are the the tasks on the left, common data curation tasks; it's not an exhaustive list. And the middle column shows how you accomplish them by point and click in JMP. The method for extracting the reusable script is listed on the right. So I'm not going to cover everything in here. But yeah, this is for you for your reference later. Let's get into a demo. And I'll show how to address some of those issues that Mia identified with the components data table. I'm going to start in JMP 15. And the first thing that we're going to talk about are some of those column problems, changing changing the data types, the modeling types, that kind of thing. Now, if you were just concerned with point and click in JMP, what you would ordinarily do is, for for let's say for humidity. This is the column you'll remember that has some text in that and it's coming in mistakenly as a character column. So to fix that by point and click, you would ordinarily right click, get into the column info, and address those changes here. This is one of those JMP tasks that doesn't leave behind usable script in in JMP 15. So for this, we're going to use the data cleaning script assistant instead. So here we go. It's in the add ins menu, because I've installed it, you can install it too. Data cleaning script assistant, the tool that we need for this is Victor the cleaner. This is a graphical user interface for making changes to columns, so we can address data types and modeling types here. We can rename columns, we can change the order of columns, and delete columns, and then save off the script. So let's make some changes here. For humidity, that's the one with the the N/A values that caused it to come in as text. We're going to change it from a character variable to a numeric variable. And we're going to change it from nominal to continuous. We also identified batch number needs to come...needs to get changed to to nominal; part number as well needs to get changed to nominal and the process, which is a number right now, that should also be nominal. fab tech. So that's not useful for me. Let's delete the facility column. I'm going to select it here by clicking on its name and click Delete. Here are a couple of those cosmetic changes that Mia mentioned. Scrap rate is at the end of my table. I want to move it earlier. I'm going to move it to the fourth position after customer number. So we select it and use the arrows to move it up in the order to directly after customer number. Last change that I'm going to make is I'm going to take the pressure variable and I'm going to rename it. My engineers in my organization called this column psi. So that's the name that I want to give that column. Alright, so that's all the changes that I want to make here. I have some choices to make. I get to decide whether the script gets saved to the data table itself. That would make a little script section over here in the upper left panel. Where to save it to its own window, let's save it to a script window. You can also choose whether or not the cleaning actions you specified are executed when you click ok. Let's let's keep the execution and click OK. So now you'll see all those changes are made. Things have been rearrange, column properties have changed, etc. And we have a script. We have a script to accomplish that. It's in its own window and this little program will be the basis. We're going to build our data curation script around it. Let's let's save this. I'm going to save it to my desktop. And I'm going to call this v15 curation script. changing modeling types, changing data types, renaming things, reordering things. These all came from Victor. I'm going to document this in my code. It's a good idea to leave little comments in your code so that you can read it later. I'm going to leave a note that says this is from the Victor tool. And let's say from DCSA, for data cleaning script assistant Victor. So that's a comment. The two slashes make a line in your program; that's a comment. That means that the program interpreter won't try to execute that as program code. It's recognized as just a little note and you can see it in green up there. Good idea to leave yourself little comments in your script. All right, let's move on. The next curation task that I'm going to address is a this supplier column. Mia told us how there were some problems in here that need to be addressed. We'll use the recode tool for this. Recode is one of the tools in JMP 15 that leaves behind its own script, just have to know where to get it. So let's do our recode and grab the script, right click recode. And we're going to fix these data values. I'm going to start from the red triangle. Let's start by converting all of that text to title case, that cleaned up this lower case Hersch value down here. Let's also trim extra white space, extra space characters. That cleaned up that that leading space in this Anderson. Okay. And so all the changes that you make in the recode tool are recorded in this list and you can cycle through and undo them and redo them and cycle through that history, if you like. All right, I have just a few more changes to make. I'll make the manually. Let's group together the Hershes, group together the Coxes, group together all the Andersons. Trutna and Worley are already correct. The last thing I'm going to do is address these missing values. We'll assign them to their own category of missing. That is my recode process. I'm done with what I need to do. If I were just point and clicking, I would go ahead and click recode and I'd be done. But remember, I need to get this script. So to do that, I'm going to go to the red triangle. Down to the script section and let's save this script to a script window. Here it is saved to its own script window and I'm just going to paste that section to the bottom of my curation script in process. So let's see. I'm just going to grab everything from here. I don't even really have to look at it. Right. I don't have to be a programmer, Control C, and just paste it at the bottom. And let's leave ourselves a note that this is from the recode red triangle. Alright, and I can close this window. I no longer need it. And save these updates to my curation scripts. So that was recode and the way that you get the code for it. All right, then the next task that we're going to address is calculating a yield. Oh, I'm sorry. What I'm going to do is I'm going to actually execute that recode. Now that I've saved the script, let's execute the recode. And there it is, the recoded supplier column. Perfect. All right, let's calculate a yield column. This is a little bit redundant, I realize we already have the scrap rate, but for purposes of discussion, let's show you how you would calculate a new column and extract its script. This is another place in JMP 15 where you can easily get the script if you know where to look. So making our yield column. New column, double click up here, rename it from column 16 to yield. And let's assign it a formula. To calculate the yield, I need to find how many good units I have in each batch, so that's going to be the batch size minus the number scrapped. So that's the number of good units I have in every batch. I'm going to divide that by the total batch size and here is my yield column. Yes, you can see that yield here is .926. Scrap rate is .074, 1 minus yield. So good. The calculation is correct. Now that I've created that yield column, let's grab its script. And here's the trick, right click, copy columns. from right click, copy columns. Paste. And there it is. Add a new column to the data table. It's called yield and here's its formula. Now, I said, you don't need to know any programming, I guess here's a very small exception. You've probably noticed that there are semicolons at the end of every step in JSL. That separates different JSL expressions and if you add something new to the bottom of your script, you're going to want to make sure that there's a semicolon in between. So I'm just typing a semicolon. The copy columns function did not add the semicolon so I have to add it manually. All right, good. So that's our yield column. The next thing I'd like to address is this. My processes are labeled 1 and 2. That's not very friendly. I want to give them more descriptive labels. We're going to call Process Number 1, production; and Process Number 2, experimental. We'll do that with value labels. Value labels are an example of column properties. There's an entire list of different column properties that you can add to a column. This is things like the units of measurement. This is like if you want to change the order of display in a graph, you can use value ordering. If you want to add control limits or spec limits or a historical sigma for your quality analysis, you can do that here as well. Alright. So all of these are column properties that we add, metadata that we add to the columns. And we're going to need to use the Data Cleaning Script Assistant to access the JSL script for adding these column properties. So here's how we do it. At first, we add the column properties, as usual, by point and click. I'm going to add my value labels. Process Number 1, we're going to call production. Add. Process Number 2, we're going to call experimental. And by adding that value label column property, I now get nice labels in my data table. Instead of seeing Process 1 and Process 2, I see production and experimental. Data Cleaning Script Assistant. We will choose the property copier. A little message has popped up saying that the column property script has been copied to the clipboard and then we'll go back to our script in process. from the DCSA property copier. And then paste, Control V to paste. There is the script that we need to assign those two value labels. It's done. Very good. Okay, I have one more data curation step to go through, something else that we'll need the Data Cleaning Script Assistant for. We want to consider only, let's say, the rows in this data table where vacuum is off. Right. So there are 313 of those rows. And I just want to get rid of the rows in this data table where vacuum is on. So the way you do it by point and click is is selecting those, much as I did right now, and then running the table subset command. In order to get usable code, we're going to have to use the Data Cleaning Script Assistant once again. So here's how to subset this data table to only the rows were vacuum is off. First, I'm going to use, under the row menu, under the row selection submenu, we'll use this Select Where command in order to get some reusable script for the selection. We're going to select the rows were vacuum is off. And before clicking okay to execute that selection, again, I will go to the red triangle, save script to the script window. Control A. Control C to copy that and let's paste that at again From rows. Select Where Control V. So there's the JSL code that selects the rows where vacuum is off. Now I need, one more time, need to use the Data Cleaning Script Assistant to get the selected rows. Oh, let us first actually execute the selection. There it is. Now with the row selected, we'll go up to again add ins, Data Cleaning Script Assistant, subset selected rows. I'm being prompted to name my new data table that has the subset of the data. Let's call it a vacuum, vacuum off. That's my new data table name. Click OK, another message that the subset script has been copied to the clipboard. And so we paste it to the bottom. There it is. And this is now our complete data curation script to use in JMP 15 and let's just run through what it's like to use it in practice. I'm going to close the data table that we've been working on and making corrections to doing our curation on. Let's close it and revert back to the messy state. Make sure I'm in the right version of JMP. All right. Yes, here it is, the messy data. And let's say some new rows have come in because it's a production environment and new data is coming in all the time. I need to replay my data curation workflow. run script. It performed all of those operations. Note the value labels. Note that humidity is continuous. Note that we've subset to only the rows where vacuum is off. The entire workflow is now reproducible with a JSL script. So that's what you need to keep in mind for JMP 15. Some tools you can extract the JSL script from directly; for others, you'll use my add in, the Data Cleaning Script Assistant. And now we're going to show you just how much fun and how easy this is in JMP 16. I'm not going to work through the entire workflow over again, because it would be somewhat redundant, but let's just go through some of what we went through. Here we are in JMP 16 and I'm going to open the log. The log looks different in JMP 16 and you're going to see some of those differences presently. Let's open the the messy components data. Here it is. And you'll notice in the log that it has a section that says I've opened the messy data table. And down here. Here is that JSL script that accomplishes what we just did. So this is like a running log that that automatically captures all of the script that you need. It's not complete yet. There are new features still being added to it. And I, and I assume that will be ongoing. But already this this enhanced log feature is very, very useful and it covers most of your data curation activities. I should also mention that, right now, what I'm showing to you is the early adopter version of JMP. It's early adopter version 5. So when we fully release the production version of JMP 16 early next year, it's probably going to look a little bit different from what you're seeing right now. Alright, so let's just continue and go through some of those data curation steps again. I won't go through the whole workflow, because it would be redundant. Let's just do some of them. I'll go through some of the things we used to need Victor for. In JMP 16 we will not need the Data Cleaning Script Assistant. We just do our point and click as usual. So, humidity, we're going to change from character to numeric and from nominal to continuous and click OK. Here's what that looks like in the structured log. It has captured that JSL. All right, back to the data table. We are going to change the modeling type of batch number and part number and process from continuous to nominal. That's done. That has also been captured in the log. We're going to delete the facility column, which has only one value, right click Delete columns. That's gone. PSI. OK, so those were all of the tool...all of the things that we did in Victor in JMP 15. Here in JMP 16, all of those are leaving behind JMP script that we can just copy and reuse down here. Beautiful. All right. Just one more step I will show you. Let's show the subset to vacuum is off. Much, much simpler here in JMP 16. All we need to do is select all the off vacuums; I don't even need to use the rows menu, I can just right click one of those offs and select matching cells, that selects the 313 rows where vacuum is off. And then, as usual, to perform the subset, to select to subset to only the selected rows, table subset and we're going to create a new table called vacuum off that has only our selected rows and it's going to keep all the columns. Here we go. That's it. We just performed all of those data curation steps. Here's what it looks like in the structured log. And now to make this a reusable, reproducible data curation script, all that we need to do is come up to the red triangle, save the script to a script window. I'm going to save this to my data...to my desktop as a v16 curation script. And here it is. Here's the whole script. So let's uh let's close all the data in JMP 16 and just show you what it's like to rerun that script. Here I am back in the home window for JMP 16. Here's my curation script. You'll notice that the first line is that open command, so I don't even need to open the data table. It's going to happen in line right here. All I need to do is, when there's new data that comes in and and this file has been updated, all that I need to do to do my data curation steps is run the script. And there it is. All the curation steps and the subset to the to the 313 rows. So that is using the enhanced log in JMP 16 to capture all your data curation work and change it into a reproducible script. Alright, here's that JMP 15 cheat sheet to remind you once again, these, this is what you need to know in order to extract the reusable code when you're in JMP 15 right now, and you won't have to worry about this so much once we release JMP 16 in early 2021. So to conclude, Mia showed you how you achieve data curation in JMP. It's an exploratory and iterative process where you identify problems and fix them by point and click. When your data gets updated regularly with new data, you need to automate that workflow in order to save time And also to document your process and to leave yourself a trail of breadcrumbs when you when you come back later and look at what you did. The process of automation is translating your point and click activities into a reusable JSL script. We discussed how in JMP 15 you're going to use a combination of both built in tools and tools from the Data Cleaning Script Assistant to achieve these ends. And we also gave you a sneak preview of JMP 16 and how you can use the enhanced log to just automatically passively capture your point and click data curation activities and leave behind a beautiful reusable reproducible data curation script. All right. That is our presentation, thanks very much for your time.  
Serkay Ölmez, Sr Staff Data Scientist, Seagate Technology Fred Zellinger, Sr Staff Engineer, Seagate   With many users and multiple developers, it becomes crucial to manage and source control JSL scripts. This talk outlines how to set up an open source system that integrates JSL scripts with GIT for source control and remote access. The system can also monitor the usage of scripts as well as crash log collection for debugging. Other features such as VBA scripting for PPT generation, unit testing, and user customization are also integrated to create high quality JSL scripts for a wide range of user base.     Auto-generated transcript...   Speaker Transcript Serkay Olmez Hello this is Serkay and Fred from Seagate and today we are going to talk about how we built a JSL ecosystem to manage our JSL scripts. So we are using Git to manage the source controls, the source of our JSL scripts, as well as to distribute them. So that will be the main point of the talk today. But in addition to that, I will be also talking about VBA for PowerPoint integration and also crash log collection, as well as unit tests. I will jump to the outline quickly. So what I want to first talk about is about our history which JSL. I've been scripting in JSL for about 10 years or so. And we are very happy with where we are at now, but it took us quite a bit of time to get here and I want to talk about the milestones of our experience, what we did so far and why we did so. And then we'll be talking about Git and how we enabled Git to source control our JSL scripts and how it enabled us to build further features. For example, once you enable Git, you can add more features such as monitoring your scripts. So you know that the developers can know that their scripts are in use, so they can monitor the usage. And probably more importantly, they can also collect logs. If the script crashes, the developer know and he or she can go back and fix those bugs. And you can... you can add one more feature going one step further. You can even create automated bug tickets you already know that your script crashed. So you can automatically create a bug ticket for that. And you can track those tickets using a tracking software such as JIRA or Atlassian. And I will end up with our best practices and lessons learned so far. So, And in the appendix I also have a manual for a script I will be talking about. It's a script that can push images to PowerPoint. And I have a very detailed manual for that, and I should just let you know that everything I talk about here, the data and the scripts will be available. And they are posted in a public repository in the references section here and you can just go there and grab those files if you choose to do so. So let me start with with the brief history here. So 10...I've been working with JSL for about 10 years or so, and started, I started with very basics and I didn't really know much about JSL scripting. And then what once you start doing that you realize that you have to have some proper source control and you you need proper ways of distributing your scripts. So the thing about JMP scripting is that it has zero barrier to entry. So you can literally do a plot manually and then go grab the script. It's written for you. And you can also use Community to JMP in JMP.com to ask questions and get answers. And you...the scripting's so efficient that you're doing your job so efficiently and people will notice. They will ask you how you do things and you'll say I have a script for that. And they will ask, "Can you share that with me?" And all of a sudden you become a developer, although you didn't intend to do so. And then you have to deal with distributing your scripts. The first idea that comes to mind is just attach them, which is a horrible idea, and I've done that for quite a while. And it's kind of obvious why that's not a good idea, because you attach a script and then you send it out and then the next day you make a revision and then you have to send it again, and you don't even know if the user will go with this next one. So I'm just illustrating the point here. Recently, I got an email from a colleague and he was referring to a script I created in 2017 and I was just numbering my scripts with these version numbers, which which is not a good idea. So it comes back after three, four years and you realize that people are still using three year old script, because they didn't update. One way of solving this problem is to use shared drive and based on the interaction I had with people in in them in the Discovery Summits, is that many people, many companies are using this one. So what what developers do is to dump their scripts into a shared drive and users will be pulling directly from the shared drive, which solves half of the problem that distribution problem, but it doesn't do anything about source controlling. You don't...you cannot trace the changes you did in the code. And that's why we actually moved to Git. And that was a breakthrough for us and enabled lots of features. So what do you do with Git is that developers will push their scripts to a repository and users will be pulling their scripts directly from Git. So that was a big improvement for us and it enabled us to collect crash logs and usage of the scripts, etc. And I just want to talk about a couple more things I learned from the Summits as well. They have been quite useful to improve my scripting skills and I attended a summit in 2018 and I learned quite a bit about expressions, etc. So people may want to go back and listen to those presentations, because they do help with the scripting skills. And one other milestone for us was about the testing. So I was inspired by this talk in the Summit last year, which was about unit testing and that enables you to automatically test your scripts before you publish them. Today I will be mostly talking about integration tests because unit tests are...unit tests are required, but not sufficient. Because you can do a unit test you can test all of your modules and they check out fine. But when you put them together, they won't work. They will crash, as illustrated here. Each drawer is tested, probably, but when you put them together, they won't operate. One nice feature that helps the developer quite a bit is about log collection. It is so helpful to know that your script crashed and, you know, how it crashed. And you can do that by collecting the logs from users and you can go back and fix your script and push the changes, and users will have their fixed script right away. And the final feature we are rolling out rather recently is about automated bug reporting. Since you already have the crash report, why not act upon that information? And so we create an automated monitoring system, which will track those crash logs and it will create bug reports automatically so that the developer can work on those. And on top of it, you can add the user as a watcher so that the user will know that somebody is working on the problem. So that's, that's that solves the that solves the information gap between the user and the developer. So this shows the timeline and I'll be spending some time on the on the individual items, but we'll start with Git and I will build this quickly and let Fred to talk to this. 163084 Okay. So, Git is just a version controlling system, and I don't use the word version control. Actually I want to say that Git helps you by giving you, effectively, unlimited redo, even if you use Git on your own local computer only. It gives you the ability to track revisions of your files and go back to old ones, in case you decide that some work you did the last few weeks was incorrect and you want to go back to something from several weeks ago. So Git is just a software that you install on your local computer. It creates a repository, and that repository can then be posted up to other repositories. Such as GitLab or GitHub. TortoiseGit is a GUI interface to get on your local machine that makes things easier. So if, Serkay, if you could open the next slide. So the model that I've tried to get developers to use is on their local computer. The developers on local, have them install the Git client and start version controlling the files on their local PC. Once they get in the habit of doing that, then we can connect them to a remote repository and Git connects to remote repositories or SSH generally or other methods and they can push copies of their Git database up to the remote repository. Once it's up on the remote repository, then it's available for an HTTP web server to share back out and JMP can then point to the URLs and HTTP...HTTP web server and load scripts from it. So instead of using a shared drive, we're using an HTTP web server that was populated by a push to a Git repository. And then the bullet items down below just point out the the benefits of that and how exactly it works. We can go to the next slide. Serkay Olmez So I will take over here, Fred, and I will show you...show a very basic illustration implementation, and the scripts are available in the References. So, this will be a very basic code. And what I'm showing here is the code that the developer is developing. Its its its bare bare minimum, right. It's just a dialog box that says, hello world. And assume the developer wants to pass this code to users. So instead of giving this code, what the developer passes is this code, which is a link to the repository. So this is the repository to the hello JSL script which lives in there... in the remote repository. What the user does is, it just grabs...the user grabs the script from the URL. So this is static code. It doesn't need any change, so developer can change the script whenever he or she wants, but the user doesn't have to do anything. And I will just show you an illustration here and I just want to go to the full screen. Just, just illustrates how this thing works with a particular a GUI, which is get GitHub Desktop. So what you do is you install this software in your computer and create a local repository, and they'll keep track of the changes you you have done. For example, in this case, I just created this hello JSL script and this software will know that there is a change and it will be highlight it automatically. And what you do is, you first committ it to your local repository, which pushes those things into your local repo and then you will be pushing it up to the origin, which is a remote repository. Now I will be pushing it to to the remote which will make it available to the users. So once you do that, now the users will be able to pull this new code. And I'm now switching back to the user role here and this the script, the user runs, and when...you once you do that, the dialog box will show up. So you are running the script that you pulled from the, from the repository. So I will illustrated...illustrate this a little better with a more advanced code, which will be also related to PowerPoint. So people do lots of analysis in JMP, and many people still, at the end of day, want to push their results into PowerPoint. I know JMP has some capabilities to push images directly to PowerPoint, but we wanted a little more...something a little more sophisticated. We want to manage the template of the PowerPoint. We want to do some more stuff within the PowerPoint, decide how many, how many images we want to put per slide, etc. And in order to do that, you first need to connect JMP with PowerPoint and you can do it actually very... in a very good way so you don't even have to leave JMP. So what what you can do is you can locate where your PowerPoint executable file is, and it's typically under Program Files. And then you create a batch file that will trigger this PowerPoint executable, and it will go and grab the PowerPoint file you want to run and it will just run it. So you can do all of this in JMP. Basically what this does is it searches for the PowerPoint executable file. And once it locates it, it bundles it into this batch file. And it also includes the path to the PowerPoint you want to run and the macro and so the PowerPoint will have a macro in it to do the management within the PowerPoint. So this is how you tie JMP to PowerPoint and you're gonna have to even leave the JMP into have to do that. But the question is, how do you get this PowerPoint file to your users to begin with? You could ask him to go and download it, which is not ideal because you want to manage this automatically and at the same time you want to source control your PowerPoint as well. So it needs to be a good candidate; it needs to be a good part of the whole ecosystem. You don't want to put it outside. And one other requirement is that you don't want to download PowerPoint every time the script runs. So you want to download the PowerPoint only if it has changed in that image repository or something new with the PowerPoint. So, and a way of doing this is to use this little JMP script, a JMP command which is creation date. So what I'm doing here is to check the date of the local file. So if the local PowerPoint is older than what I have in the repository, then I will go and download it. So it will go and download the PowerPoint using HTTP request, and it looks something like this. So what you do is you just go check your PowerPoint in your local computer. If it doesn't exist, you can just go and pull it from the repository using HTTP request. If it exists, you like look at its date, and if it is old, you still go and pull the new one. So the bottom line here is that you can integrate PowerPoint seamlessly into JMP environment, so you can push your results into JMP, including tables, images. And you can do all of this without breaking source control and I will show a demonstration here. So this will be the code, for example, you would give to your users. Again, this is static code. It really has nothing except for a URL, which links at the script to the remote repository. So it will go and grab the JMP script from the repository and those scripts are again available and you can just pull them from the References. From the, from the, from the developer side, this is the actual script, right. This is the script, the developer has developed and he or she pushes it to remote repository. And the nice thing about this script is that, I just want to point out a couple of features here quickly. So if, for example, look at this. This is a standalone text script and you don't you don't have to distribute PowerPoint files or additional scripts separately. What you do is you link them in your script and they are linked in the repository here, including the PowerPoint. So this script will manage all the distribution. So, it will go and grab the PowerPoint. It will go and grab other additional files if needed. So everything is bundled in together into the script, so it does all the management for you. So let me show this quickly. I will pull this back and put it into full screen so you can see clearly. So this will be a demonstration of triggering PowerPoint automatically for it for a particular case. And what this script does is it, it gets the paths from the table, this a JMP table that includes lots of image paths, and those are referring to this particular server, and it will take those paths and it will just take the highlighted ones here and it will push them into PowerPoint. I will just run this and this script is available to you if you want to give it a try. And it's a fully functional useful script. So, and I am running this script here. Again, it is referring to the repository. So it doesn't have the row script, it's just pulling it from the repository. You run it, it retrieves the code, runs a dialogue. And I will simply run this. I won't go into the details, and once once you run it, it will just trigger the PowerPoint automatically. And PowerPoint will launch and then it will...it's starting...it's building the slides here. It will pull 20 images and this will take 10 or 20 seconds and then you will have the PowerPoint done. So this is the PowerPoint you get and everything is done automatically and everything was pulled from the repository. So what is next? How would you revise this PowerPoint? So, i assukme you want to make some changes to your template. And you can see more details about this in the appendix, but what I want to show is is how you change the script in the PowerPoint. And I'll go to the full screen, and I will just just run this. What we are starting from is what we what we're left with in the previous slide, right, so you had this these images. What I want to do is to change the template in a trivial way just illustrate the point. So you go to... you go to the macro, you scroll down and find the thing you want to change, and I will be doing a simple change here. I will be changing the header color just to make a point, right. So you change this and save it. And this is this is going real time. You save this and can delete the slides so that it goes faster to repository, because you will be pushing this to the repository. And then you push this out using Git GUI. It will go into the repository and it will make it available to your users. So all of your users will get the modification instantly. So, so you don't you don't have to ask them to go and update the PowerPoint or anything. This is just going to the repository. And I will switch back to JMP and in the user mode and then run the script again, which will retrieve the new PowerPoint. And if you run it again, it will go through the slides, it will create the slide. And what you will notice is that you do have the changes, which was about changing the color of the header line. So, what else do we have? Script monitoring. I think this was one of the best features we developed, because this gives the ability to the developer to see whether their script...his, his script is appreciated or not. Right. So is it run? Is somebody running it on a regular basis? So that's one benefit. The other benefit is about log...crash log collection. So if the script fails, you can capture the failure by using this log capture functionality of JMP. So you sandwich your subroutines into log capture. If it fails...the failing...the log of the failure will be returned into this log returned text. And then what you also do is you enclose it in a try, so that the script still survives to to do the reporting. And then you check whether it's empty. If it's empty, that means there was no crash at all, so your scripts survived. But if not, if it's not empty, that means there was a crash and you contained it. You can you you grab the log and now you can report it. And the way to do it is to use HTTP user ID, the script name, and the lognote, whether it's a crash... whether is was a crash or just a regular, run, maybe some performance metrics. So the bottom line is you can transmit some some some data metadata about your script back to the developer. And what the server will do is, is to log them and you will have a set of files stored in the server and you can monitor them. And I'm just showing a sample table out of these. It's a crash log, which has...which has the date stamp, it has a script name, so this script has failed at the date with this particular crash. So this is extremely useful for the developer because he or she can go back and fix the issue. So I will collapse this and this and I will do a quick quick demonstration here that shows how you how you contain the crash and how you report it to the user. So what I will do here is to make a subroutine crash. I will create an undefined parameter and JMP will complain. It will it will say this thing is not defined, so the script is is crashing. But what I will do is I will call that subroutine in a log capture functionality and everything will be sandwiched in under try. So although the script crashed with the subroutine, it will survive overall, and it will it will be able to report a crash report. So do you give it nice nicely formatted notification to the user and say that. So the scripts crash with this particular error and we created a log for it and we are working on it. So that's, that's the notification you give to them...to the user. So one one key thing I learned in the Summits was about automated testing. So I, I used to do my testings manually, which is which is very frustrating. Because it takes time and you cannot capture each and every corner of your script. It's, it's impossible to do hours of testing when you when you do a small change. So I started getting into the automated testing and it is it's very important to do unit testings. So, you know, testing refers to the testing of individual subroutines. So you have a function and you want to test it multiple times before you put it into into your overall system, so you can do these unit tests for individual modules, but it won't be enough. And I can show you a couple of examples of that. For example, NASA lost their Mars Climate Orbiter for a very strange...because of a very strange error because there were two software teams, one in Europe, one in the US. And one of them was working with the units of pounds and the one in the, yeah that was the one in the US, and the one in Europe was using Newtons. So they they certainly did their unit tests. But when they put together their code, it didn't work, because one was expecting Newtons and the other one was getting pounds. So they they literally lost the Orbiter, just because they they forgot to convert from pounds to Newtons. And more recently as Starliner, Boeing lost their Starliner and then they admitted that they could have caught the error if they had done a rigorous integration test. And at the end of day, what counts is the integration test because modules don't live alone in the script, they talk together. And this is a funny illustration of the problem here. So you can have two objects which are tested thoroughly. It's a window, right, what can go wrong with a window? But if you have two of them put together, they won't even open, so you have to do some integration tests to see if they are combined, if they work okay together. And I want to illustrate a quick point here that shows how I started doing automated testing. And this will refer to a particular case, which is a difficult case, to be fair, because this will be testing of a modal window and modal windows won't go away unless you click on click on a button. So you have to create a JMP script that clicks on a button, that kind of mimics a human behavio,r and what I'm doing here is to is to inject a tester into into the modal window here. And the nice thing about this approach, I think, is that you can still distribute this thing, this script to your users and the users won't even realize that there's actually some testing routine in it. So it has the hooks for a tester, but then you run this alone, what it does is it looks for this test mode parameter and it's not set. So if it's not set the script will set it to zero that that will disable all the hooks in the table. So you can give this to your users and they can run it as this. However, if you want to test your script, what you do is you build a tester code on top of it. So the tester code will set the test mode to one and it will also create the tester... tester object you want to inject into your script. For example, this particular case, what it does is, it's selecting a particular column and then it's assigning it into a role by clicking a button and then it's running the script, right. So that's literally mimicking a human behavior. So then it will load the actual script. So it's basically injecting those parameters into the script that you want to test. Now you're running the script automatically, so it will load the script and run it, and it will close the UI after clicking this button and that button. And then what you do is you run it again in a log capture functionality, so that you know if something goes wrong, you will be capturing the failure. And you also put it in the enter try so that overall, the script survives to report the log. And then if nothing was wrong, you will have empty log return and that means your script did not fail. If something has gone wrong, you will know it in the log capture. So that's the idea. So, We have the ability to monitor the scripts. We have the ability to capture logs of crashes. So we thought why, why don't we act on the crash log? If you see a crash log that means something has gone wrong, and script that...the user already knows that because the script crashed obviously, and the developer knows that the script crashed because the logs are stored in a server. So they are there for you to act on. But the thing is, this is not a closed loop yet, because the user doesn't know that the developer knows. And you can close that loop by creating automated tickets. So since you have the information already, you can use a bug tracking software such as the JIRA Atlassian and then you can use the REST functionality to collect the information from the server, create a ticket, assign it to the developer, and you can also assign the user as a watcher. The watcher means that whenever the developer does anything about the bug and enters that information into the, into JIRA ,that will be looped back to the to the user so he or she will be notified and will know that somebody is working on the bug. And there are multiple ways of doing it, depending on the flavor of JIRA you have, and I am just giving you the basic code here. It's it's a curl code that you can create and you collect the metadata that you have from the crash logs and you embed that into a JSL file and then it will be transmitted through REST and it will go into the JIRA, which is...which will track it for you. And then JIRA will, if you set up properly, JIRA will notify it...JIRA will notify the developer and possibly the user for which...for whom the script crashed. And so this is all tied together and and the developer will know there's a bug that he or she needs to work on and the user will know that somebody will be working on the bug. Okay. I think this takes me to my closing notes. Just the takeaways you can get out of this presentation. Git has been the cornerstone of our system and it has it has enabled us to do lots of nice features and I didn't even mention the basic ones, which which are kind of obvious, because that gives you the ability to collaborate as well. So if you're multiple people working on the same script. You can use all the functionalities of the  
Stanley Siranovich, Principal Analyst, Crucial Connection LLC   Much has been written in both the popular press and in the scientific journals about the safety of modern vaccination programs. To detect possible safety problems in U.S.-licensed vaccines, the CDC and the FDA have established the Vaccine Adverse Event Reporting System (VAERS). This database system now covers 20 years, with several data tables for each year. Moreover, these data tables must be joined to extract useful information from the data. Although a search and filter tool (WONDER) is provided for use with this data set, it is not well suited for modern data exploration and visualization. In this poster session, we will demonstrate how to use JMP Statistical Discovery Software to do Exploratory Data Analysis for the MMR vaccine over a single year using platforms such as Distribution, Tabulate, and Show Header Graphs. We will then show how to use JMP Scripting Language (jsl) to repeat, simply and easily, the analysis for additional years in the VAERS system.     Auto-generated transcript...   Speaker Transcript Stan Siranovich Good morning everyone. Today we're going to do a exploratory data analysis of the VAERS database. Now let's do a little background on what this database is. VAERS, spelled V-A-E-R-S, is an acronym for Vaccine Adverse Effect Reporting System. It was created by the FDA and the CDC. It gets about 30,000 updates per year and it's been public since 1990 so there's quite a bit of data on it. And it was designed as an early warning system to look for some effects of vaccines that have not previously been reported. Now these are adverse effects, not side effects, that is they haven't been linked to the vaccination yet. It's just something that happened after the vaccination. Now let's talk about the structure. VAERS VAX and VAERS DATA. Now there is a tool for examining the online database and it goes by the acronym of WONDER. And it is traditional search tool where you navigate the different areas of the database, select the type of data that you want, click the drop down, and after you do that a couple of times, or a couple of dozen times, what you do is send in the query. And without too much latency, get a result back. But for doing exploratory data analysis and some visualizations, there's a slight problem with that. And that is that you have to know what you want to get in the first place, or at least at the very good idea. So that's where JMP comes in. And as I mentioned, we're going to do an EDA and some visualization on on specific set of data, that is data for the MMR vaccine for measles, mumps, and rubella. And we're going to do for the most recent full year available, which will be 2019. So let me move to a new window. Okay, the first thing we did and which I omitted here was to download the CSVs and open them up in JMP. Now I want to select my data and JMP makes it very easy. After I get the window open, I simply go through rows, rows selection and select where and down here is a picture that I want the VAX_TYPE and I wanted it to equal MMR. Now there's some other options here besides equals, which we'll talk about in a second. And after we click the button, and we've selected those rows, the next thing we want to do is decide on which data that that we want. So I've highlighted some of the columns and in a minute or so you'll see why. And then when I do that, oh, before we go there, let's note row nine and row 18 right here. Notice we have MMRV and MMR. MMRV is a different vaccine. And if we wanted to look at that also, we could have selected contains here from the drop down. But that's not what we wanted to do. So we click OK and we get our table. Now what we want to do is join that VAERS VAX table which contains data about the vaccine, such as a manufacturer, the lot and so forth with the VAERS DATA table, which contains data on on the effects of vaccine, so it's it's got things like whether or not the patient had allergies, whether or not the patient was hospitalized, number of hospital days, that sort of thing. And it also contains demographic data such as age and sex. So what we want to do is join and simply go to tables join and we select The VAERS VAX and VAERS DATA tables and we want to join them on the VAERS ID. And again, JMP makes it pretty easy. We just click the column in each one at one of the separate tables and we put them here in the match window and after that we go over to the table windows and we select the columns that we want. And this is what our results table looks like. Now let me reduce that and open up and JMP table. There we go, and I'll expand that. And for the purposes of this demonstration I just selected these...these columns here. We've got the VAERS ID, which you see identification obviously, the type which are all MMR. And looks like Merck is the manufacturer. And there's a couple of unknowns scattered through here. And I selected VAX LOT, because that would be important if there's something the matter with one lot, you want to be able to see that. This looks like cage underscore year, but that is calculated age in years. There are several H columns and I just selected one. And I selected sex because we'd like to know if somebody is is more affected, if males are more affected than females or vice versa. And HOSPDAYS is the number of days in the hospital if they had an adverse effect that was severe enough to put them into the hospital. And NUMDAYS is the number of days between vaccination and the appearance of the adverse effects and it looks like we we have quite, quite a range right here. So let's, let's get started on our analysis. show header graphs. So I'm going to click on that and show header graphs. And we get some distribution, and some other information up here. We'll skip the ID and see that the VAX_TYPE is all MMR, you have no others there. And the vax manufacturer, yes, it's either a Mercks & Co Inc or unknown and one nice feature about this is we can click on the bar and it will highlight the rows for us and click away and it's unhighlighted. Moving on to VAX_LOT, we have quite a bit of information squeezed into this tiny little box here. First of all, we have the top five lots in occurrence in in our data table and here they are, and here's how many times they appear. And it also tells us that we have 413 other lots included in table, plus five by my calculation, that's something like 418 individual lots. Now we go over the calculated age in years and in we see most of our values are between zero and whatever, they're during zero bin, which makes sense because it is a vaccination and we'll just make a note of that. And we go over to the sex column and it looks like we have significantly more females than males. Now, that tells us right away if we want to do, side by side group comparisons, we're going to have to randomly select from females, so that they equal the males and we also have some unknowns here, quite a few unknowns. And we simply note that and move on. And we see hospital days. And we've see NUMDAYS. Now here's another really really nice feature. Let's say we'd like more details and we want to do a little bit of exploration to see how the age is distributed, we simply right click, right click, select open in distribution. And here we are in the distribution windows, but quite a bit of information here. For our purposes right now, we don't really do much here about the quantiles. So let's click close and it's still taking up some space. So let's go down here and select outline close orientation and let's go with vertical. And we're left with a nice easy to read window. It's got some information in there. We of course see our distribution down here and we've got a box and whisker plot up here. There's not a whole lot of time to go into that, that, that just displays data in a different way. And we see from our summary statistics that the mean happens to be 16.2, with the standard deviation 20.6. Not an ideal situation. So if you want to do anything more with that, we may want to split the years in two groups where most of them are down here and and then where, where this, where all the skewed data is and then the rest of them and along the right and examine that separately. And I will minimize that window and we can do the same with hospital days and number of days. And let me just do that real quick. And here we see the same sorts of data and I won't bother clicking through that and reducing it. But we might note also when again we have the mean of 6.7 and standard deviation of 13.2, again, not a very ideal situation and we simply make note of that. And I will close that. Now let's say we want to do a little bit more exploratory analysis, something caught our eye and all that. And that is simple to do here. We don't have to go back to the online database, and select through everything, click the drop downs, or whatever. We can simply come up here to analyze and fit Y by X. So let's say that we would like to examine the relationship between oh, hospital days, number of days spent in the hospital and calculated age in years. We simply do that. We have two continuous variables so we're going to get a bivariate plot out of that. We click OK. And we get another nice display of the data. And yes, we can see that currently, the mean is down around 5 or 6, which is a good, good thing better than 10 or 12. We can, for purposes of references, go up here to the red triangle, select fit mean and we get the mean, right here. And we noticed there's quite a few outliers. Let's say we want to examine them right now and decide whether or not we want to delve into them a little bit further. So if we hover over one of our outlier points or any of the points for that matter, we see we get this pop up window and it tells us that particular data point represents row 868. Calculated age is in the one year bucket, and this patient happened to spend 90 days in the hospital. Now we could right click and color this row or put some sort of marker in there. I won't bother doing that, but I will move the cursor over here into the window, and we see this little symbol up in the right hand corner, click that and that pins it. So we can, of course, repeat that. And we can get the detail for further examination. I found this to be quite handy when giving presentations to groups of people like to call attention to one particular point. That's a little bit overbearing so let's right click, select...select font, not edit. And we get the font window come up and see we're using 16 point font. Let's, I don't know, let's go down to 9. And that's a little bit better and it gives us more room if we'd like to call attention to some of the other outliers. So in summary, let me bring up the PowerPoint again. In summary, we were able to import and shape two large data tables from a large online government maintained database. We were able to subset tables, able to join the tables and select our output data all seamlessly. And we were able to generate summaries and distributions, pointing out the areas that may be of interest and for more detailed analysis. And of course, that was all seamless and all occured with within the same software platform. Now, supply some links right over here to the various data site. This, this is the main site, which has all the documentation that the government did quite a good job there. And here is the actual data itself in the zip..  
Ruth Hummel, JMP Academic Ambassador, SAS Rob Carver, Professor Emeritus, Stonehill College / Brandeis University   Statistics educators have long recognized the value of projects and case studies as a way to integrate the topics in a course. Whether introducing novice students to statistical reasoning or training employees in analytic techniques, it is valuable for students to learn that analysis occurs within the context of a larger process that should follow a predictable workflow. In this presentation, we’ll demonstrate the JMP Project tool to support each stage of an analysis of Airbnb listings data. Using Journals, Graph Builder, Query Builder and many other JMP tools within the JMP Project environment, students learn to document the process. The process looks like this: Ask a question. Specify the data needs and analysis plan. Get the data. Clean the data. Do the analysis. Tell your story. We do our students a great favor by teaching a reliable workflow, so that they begin to follow the logic of statistical thinking and develop good habits of mind. Without the workflow orientation, a statistics course looks like a series of unconnected and unmotivated techniques. When students adopt a project workflow perspective, the pieces come together in an exciting way.       Auto-generated transcript...   Speaker Transcript So welcome everyone. My name is 00 07.933 3 Ambassador with JMP. I am now a retired professor of Business 00 30.566 7 between a student and a professor working on a project. 00 49.700 11 12 engage students in statistical reasoning, teach that 00 12.433 16 to that, current thinking is that students should be learning about reproducible workflows, 00 36.266 21 elementary data management. And, again, viewing statistics as 00 58.800 25 26 wanted to join you today on this virtual call. Thanks for having 00 20.600 30 and specifically in Manhattan, and you'd asked us so so you 00 36.433 34 And we chose to do the Airbnb renter perspective. So we're 00 51.733 38 expensive. So we started filling out...you gave us 00 09.166 43 44 separate issue, from your main focus of finding a place in 00 36.066 49 you get...if you get through the first three questions, you've 00 54.100 53 know, is there a part of Manhattan, you're interested in? 00 11.133 58 repository that you sent us to. And we downloaded the really 00 26.433 32.866 63 thing we found, there were like four columns in this data set 00 46.766 67 figured out so that was this one, the host neighborhood. So 00 58.100 71 72 figured out that the first two just have tons of little tiny 00 13.300 76 Manhattan. So we selected Manhattan. And then when we had 00 29.700 80 that and then that's how we got our Manhattan listings. So 00 44.033 84 data is that you run into these issues like why are there four 00 03.300 88 restricted it to Manhattan, I'll go back and clean up some 00 18.033 92 data will describe everything we did to get the data, we'll talk 00 28.400 33.200 97 know I'm supposed to combine them based on zip, the zip code, 00 47.166 101 102 107 columns, it's just hard to find the 00 09.366 106 them, so we knew we had to clean that up. All right, we also had 00 27.366 111 journal of notes. In order to clean this up, we use the recode 00 45.500 115 Exactly. Cool. Okay, so we we did the cleanup 00 02.200 119 Manhattan tax data has this zip code. So I have this zip code 00 19.300 123 day of class, when we talked about data types. And notice in the 00 42.300 128 the...analyze the distribution of that column, it'll make a funny 00 03.200 133 Manhattan doesn't really tell you a thing. But the zip code clean data in 00 18.466 23.266 139 just a label, an identifier, and more to the point, when you want to join or merge 00 41.833 48.766 145 important. It's not just an abstract idea. You can't merge 00 03.166 11.266 150 nominal was the modeling type, we just made sure. 00 26.200 31.033 155 about the main table is the listings. I want to keep 00 45.533 159 to combine it with Manhattan tax data. Yeah. Then what? Then we need to 00 03.266 164 tell it that the column called zip clean, zip code clean... Almost. There we go. And the column called zip, which 00 33.200 171 172 Airbnb listing and match it up with anything in 00 57.033 177 178 them in table every row, whether it matches with the other or 00 13.233 182 main table, and then only the stuff that overlaps from the second 00 29.600 186 another name like, Air BnB IRS or something? Yeah, it's a lot 00 50.966 190 do one more thing because I noticed these are just data tables scattered around 00 06.666 195 running. Okay. So I'll save this data table. Now what? And really, this is the data 00 19.833 22.033 26.266 35.466 203 anything else, before we lose track of where we are, let's 00 49.733 58.800 01.833 209 or Oak Team? And then part of the idea of a project 00 23.700 214 thing. So if you grab, I would say, take the 00 50.100 218 219 220 two original data sets, and then my final merged. Okay Now 00 16.200 225 them as tabs. And as you generate graphs and 00 36.566 229 230 231 even when I have it in these tabs. Okay, that's really cool. 00 58.833 02.500 236 right, go Oak Team. Well, hi, Dr. Carver, thanks so 00 19.233 240 you would just glance at some of these things, and let me know if 00 32.300 244 we used Graph Builder to look at the price per neighborhood. And 00 45.400 248 help it be a little easier to compare between them. So we kind 00 01.000 252 have a lot of experience with New York City. So we plotted 00 18.166 256 stand in front of the UN and take a picture with all the 00 31.733 260 saying in Gramercy Park or Murray Hill. If we look back at the 00 46.566 265 thought we should expand our search beyond that neighborhood to 00 58.766 269 270 just plotted what the averages were for the neighborhoods but 00 14.533 274 the modeling, and to model the prediction. So if we could put 00 30.766 279 expected price. We started building a model and what we've 00 42.800 283 factors. And so then when we put those factors into just a 00 58.833 287 more, some of the fit statistics you've told us about in class. 00 15.466 292 but mostly it's a cloud around that residual zero line. So 00 30.766 296 which was way bigger than any of our other models. So we know 00 45.800 300 reasons we use real data. Sometimes, this is real. This is 00 58.266 304 looking? Like this is residual values. 00 19.266 309 is good. Ah, cool. Cool. Okay, so I'll look for 00 34.966 313 is sort of how we're answering our few important questions. And 00 47.300 317 was really difficult to clean the data and to join the data. 00 57.866 03.500 322 wanted to demonstrate how JMP in combination with a real world 00 28.700 327 Number one in a real project, scoping is important. We want to 00 47.600 331 hope to bring to the to the group. Pitfall number two, it's vital to explore the 00 08.033 336 the area of linking data combining data from multiple 00 27.800 341 recoding and making sure that linkable 00 45.100 345 346 reproducible research is vital, especially in a team context, especially for projects that may 00 05.966 351 habits of guaranteeing reproducibility. And finally, we hope you notice that in these 00 32.633 356 on the computation and interpretation falls by the 00 51.900 360  
Jeremy Ash, JMP Analytics Software Tester, JMP   The Model Driven Multivariate Control Chart (MDMVCC) platform enables users to build control charts based on PCA or PLS models. These can be used for fault detection and diagnosis of high dimensional data sets. We demonstrate MDMVCC monitoring of a PLS model using the simulation of a real world industrial chemical process — the Tennessee Eastman Process. During the simulation, quality and process variables are measured as a chemical reactor produces liquid products from gaseous reactants. We demonstrate fault diagnosis in an offline setting. This often involves switching between multivariate control charts, univariate control charts, and diagnostic plots. MDMVCC provides a user-friendly way to move between these plots. Next, we demonstrate how MDMVCC can perform online monitoring by connecting JMP to an external database. Measuring product quality variables often involves a time delay before measurements are available, which can delay fault detection substantially. When MDMVCC monitors a PLS model, the variation of product quality variables is monitored as a function of process variables. Since process variables are often more readily available, this can aide in the early detection of faults. Example Files Download and extract streaming_example.zip.  There is a README file with some additional setup instructions that you will need to perform before following along with the example in the video.  There are also additional fault diagnosis examples provided. Message me on the community if you find any issues or have any questions.       Auto-generated transcript...   Speaker Transcript Jeremy Ash Hello, I'm Jeremy ash. I'm a statistician in jump R amp D. My job primarily consists of testing the multivariate statistics platforms and jump but   I also help research and evaluate methodology and today I'm going to be analyzing the Tennessee Eastman process using some statistical process control methods and jump.   I'm going to be paying particular attention to the model driven multivariate control chart platform, which is a new addition to jump and I'm really excited about this platform and these data provided a new opportunity to showcase some of its features.   First, I'm assuming some knowledge of statistical process control in this talk.   The main thing you need to know about is control charts. If you're not familiar with these. These are charts used to monitor complex industrial systems to determine when they deviate from normal operating conditions.   I'm not gonna have much time to go into the methodology and model driven multivariate control chart. So I'll refer to these other great talks that are freely available.   For more details. I should also mention that Jim finding was that primary developer of the model driven multivariate control chart and in collaboration with Chris Got Walt and Tanya Malden I were testers.   So the focus of this talk will be using multivariate control charts to monitor a real world chemical process.   Another novel aspect of this talk will be using control charts for online process monitoring this means we'll be monitoring data continuously as it's added to a database and texting faults in real time.   So I'm going to start with the obligatory slide on the advantages of multivariate control charts. So why not use University control charts there. There are a number of excellent options and jump.   University control charts are excellent tools for analyzing a few variables at a time. However, quality control data sets are often high dimensional   And the number of charts that you need to look at can quickly become overwhelming. So multivariate control charts summarize a high dimensional process. And just a few charts and that's a key advantage.   But that's not to say that university control charts aren't useful in this setting, you'll see throughout the talk that fault diagnosis often involves switching between multivariate in University of control charts.   Multivariate control charts, give you a sense of the overall health of a process well University control charts allow you to   Look at specific aspects. And so the information is complimentary and one of the main goals of model driven multivariate control chart was to provide some tools that make it easy to switch between those two types of charts.   One disadvantage of the university control chart is that observations can appear to be in control when they're actually out of control in the multivariate sense. So I have to   Control our IR charts for oil and density and these two observations in red are in control, but oil and density are highly correlated. And these observations are outliers in the multivariate sense in particular observation 51 severely violates the correlation structure.   So multivariate control charts can pick up on these types of outliers. When University control charts can't   model driven multivariate control chart uses projection methods to construct its control charts. I'm going to start by explaining PCA because it's easy to build up from there.   PCA reduces dimensionality of your process variables by projecting into a low dimensional space.   This is shown in the in the picture to the right we have p process variables and and observations and we want to reduce the dimensionality of the process to a were a as much less than p and   To do this we use this P loading matrix, which provides the coefficients for linear combinations of our X variables which give the score variables. The shown and equations on the left.   tee times P will give you predicted values for your process variables with the low dimensional representation. And there's some prediction air and your score variables are selected.   In a way that minimizes this squared prediction air. Another way to think about it is, you're maximizing the amount of variance explained x   Pls is more suitable when you have a set of process variables and a set of quality variables and you really want to ensure that the quality variables are kept in control, but these variables are often expensive or time consuming to collect   At planet can be making out of control quality for a long time before fault is detected, so   Pls models allow you to monitor your quality variables as a function of your process variables. And you can see here that pls will find score variables that maximize the variance explained in the y variables.   The process variables are often cheaper and more readily available. So pls models can allow you to detect quality faults early and can make process monitoring cheaper.   So from here on out. I'm just going to focus on pls models because that's that's more appropriate for our example.   So pls partitions your data into two components. The first component is your model component. This gives you the predicted values.   Another way to think about this as your data has been projected into a model plane defined by your score variables and t squared charts will monitor variation in this model plane.   The second component is your error component. This is the distance between your original data and that predicted data and squared prediction error charts are sp charts will monitor   Variation in this component   We also provide an alternative distance to model x plane, this is just a normalized version of sp.   The last concept that's important to understand for the demo is the distinction between historical and current data.   historical data typically collected when the process is known to be in control. These data are used to build the PLS model and define   Normal process variation. And this allows a control limit to be obtained current data are assigned scores based on the model, but are independent of the model.   Another way to think about this is that we have a training and a test set, and the t squared control limit is lower for the training data because we expect lower variability for   Observations used to train the model, whereas there's greater variability and t squared. When the model generalized is to a test set. And fortunately, there's some theory that's been worked out for the   Variants of T square that allows us to obtain control limits based on some distributional assumptions.   In the demo will be monitoring the Tennessee Eastman process. I'm going to present a short introduction to these data.   This is a simulation of a chemical process developed by downs and Bogle to chemists at Eastman Chemical and it was originally written in Fortran, but there are rappers for it in MATLAB and Python now.   The simulation was based on a real industrial process, but it was manipulated to protect proprietary information.   The simulation processes. The, the production of to liquids.   By gassing reactants and F is a byproduct that will need to be siphoned off from the desired product.   The two season processes pervasive in the in the literature on benchmarking multivariate process control methods.   So this is the process diagram. It looks complicated, but it's really not that bad. So I'm going to walk you through it.   The gaseous reactants ad and he are flowing into the reactor here, the reaction occurs and product leaves as a gas. It's been cooled and condensed into a liquid and the condenser.   Then we have a vapor liquid separator that will remove any remaining vapor and recycle it back to the reactor through the compressor and there's also a purge stream here that will   Vent byproduct and an art chemical to prevent it from accumulating and then the liquid product will be pumped through a stripper where the remaining reactants are stripped off and the final purified product leaves here in the exit stream.   The first set of variables that are being monitored are the manipulated variables. These look like bow ties and the diagram.   Think they're actually meant to be valves and the manipulative variables, mostly control the flow rate through different streams of the process.   These variables can be set to specific values within limits and have some Gaussian noise and the manipulative variables can be sampled at any rate, we're using a default three minutes sampling in   Some examples of the manipulative variables are the flow rate of the reactants into the reactor   The flow rate of steam into the stripper.   And the flow of coolant into the reactor   The next set of variables are measurement variables. These are shown as circles in the diagram and they're also sampled in three minute intervals and the difference is that the measurement variables can't be manipulated in the simulation.   Our quality variables will be percent composition of to liquid products you can see   The analyzer measuring the composition here.   These variables are collected with a considerable time delay so   We're looking at the product in the stream because   These variables can be measured more readily than the product leaving in the exit stream. And we'll also be building a pls model to monitor   monitor our quality variables by means of our process variables which have substantial substantially less delay in a faster sampling rate.   Okay, so that's an a background on the data. In total there are 33 process variables into quality variables.   The process of collecting the variables is simulated with a series of differential equations. So this is just a simulation. But you can see that a considerable amount of care went into model modeling. This is a real world process.   So here's an overview of the demo, I'm about to show you will collect data on our process and then store these data in a database.   I wanted to have an example that was easy to share. So I'll be using a sequel light database, but this workflow is relevant to most types of databases.   Most databases support odd see connections once jump connects to the database it can periodically check for new observations and update the jump table as they come in.   And then if we have a model driven multivariate control chart report open with automatic re calc turned on. We have a mechanism for updating the control charts as new data come in.   And the whole process of adding data to a database will likely be going on on a separate computer from the computer doing the monitoring.   So I have two sessions of jump open to emulate this both sessions have their own journal in the materials are provided on the Community.   And the first session will add simulated data to the database and it's called the streaming session and the next session will update reports as they come into the database and I'm calling that the monitoring session.   One thing I really liked about the downs and Vogel paper was that they didn't provide a single metric to evaluate the control of the process. I have a quote from the paper here. We felt like   We felt that the trade offs among possible control strategies and techniques involved, much more than a mathematical expression.   So here's some of the goals they listed in their paper which are relevant to our problem maintain the process variables that desired values minimize variability of the product quality during disturbances and recover quickly and smoothly from disturbances.   So we will assess how well our process achieve these goals, using our monitoring methods.   Okay.   So to start off, I'm in the monitoring session journal and I'll show you our first data sent the data table contains all the variables I introduced earlier, the first set are the measurement variables. The next set our composition variables. And then the last set are the manipulated variables.   And the first script attached here will fit a pls model it excludes the last hundred rose is a test set.   And just as a reminder, this model is predicting our two product composition variables as a function of our process variables but pls model or PLS is not the focus of the talk. So I've already fit the model and output score columns here.   And if we look at the column properties. You can see that there's a MD MCC historical statistics property that contains all the information   On your model that you need to construct the multivariate control charts. One of the reasons why monitoring multivariate control chart was designed this way was   Imagine you're a statistician, and you want to share your model with an engineer, so they can construct control charts. All you need to do is provide the data table with these formula columns. You don't need to share all the gory details of how you fit your model.   So next I will use the score columns to create our control turn   On the left, I have to control charts t squared and SPE there 860 observations that were used to estimate the model. And these are labeled as historical and then I have 100 observations that were held out as a test set.   And you can see in the limit summaries down here that I performed a bond for only correction for multiple testing.   As based on the historical data. I did this up here in the red triangle menu, you can set the alpha level, anything you want and   I did this correction, because the data is known to be a normal operating conditions. So, we expect no observations to be out of control and after this multiplicity adjustment, there are zero false alarms.   On the right or the contribution proportion heat maps. These indicate how much each variable contributes to the outer control signal each observation is on the Y axis and the contributions are expressed as a proportion   And you can see in both of these plots that the contributions are spread pretty evenly across the variables.   And at the bottom. I have a score plant.   Right now we're just plotting the first score dimension versus the second score dimension, but you can look at any combination of the score dimensions using these drop down menus, or this arrow.   Okay, so we're pretty oriented to the report, I'm going to switch over to the monitoring session.   Which will stream data into the database.   In order to do anything for this example, you'll need to have a sequel light odd see driver installed. It's easy to do. You can just follow this link here.   And I don't have time to talk about this but I created the sequel light database. I'll be using and jump I have instructions on how to do this and how to connect jump to the database on my community webpage   This is example might be helpful if you want to try this out on date of your own.   I've already created a connection to this database.   And I've shared the database on the community. So I'm going to take a peek at the data tables in query builder.   I can do that table snapshot   The first data set is the historical data I I've used this to construct a pls model, there are 960 observations that are in control.   The next data table is a monitoring data table this it is just contains the historical data at first, but I'll gradually add new data to this and this is what our multivariate control chart will be monitoring.   And then I've simulated the new data already and added it to this data table here and see it starts at timestamp 961   And there's another 960 observations, but I've introduced a fault at some time point   And I wanted to have something easy to share. So I'm not going to run my simulation script and add the database that way.   I'm just going to take observations from this new data table and move them over to the monitoring data table using some JSON with sequel statements.   And this is just a simple example emulating the process of new data coming into a database, somehow, you might not actually do this with jump. But this is an opportunity to show how you can do it with ASL.   Next, I'll show you the script will use to stream in the data.   This is a simple script. So I'm just going to walk you through it real quick.   The first set of commands will open the new data table from the sequel light database, it opens up in the background. So I have to deal with the window, and then I'm going to take pieces from this new data table and   move them to the monitoring data table I'm calling the pieces bites and the BITE SIZES 20   And then this will create a database connection which will allow me to send the database SQL statements. And then this last bit of code will interactively construct sequel statements that insert new data into the monitoring data. So I'm going to initialize   Okay, and show you the first iteration of this loop.   So this is just a simple   SQL statement insert into statement that inserts the first 20 observations.   Comment that outset runs faster. And there's a wait statement down here. This will just slow down the stream.   So that we have enough time to see the progression of the data and the control charts by didn't have this this streaming example would just be over too quick.   Okay, so I'm going to   Switch back to the monitoring session and show you some scripts that will update the report.   Move this over to the right. So you can see the report and the scripts at the same time.   So,   This read from monitoring data script is a simple script that checks the database every point two seconds and adds new data to the jump table. And since the report has automatic recount turned on.   The report will update whenever new data are added. And I should add that realistically, you probably wouldn't use a script that just integrates like this, you probably use Task Scheduler and windows are automated and Max better schedule schedule the runs   And then the next script here.   will push the report to jump public whenever the report is updated.   I was really excited that this is possible and jump.   It enables any computer with a web browser to view updates to the control chart. You can even view the report on your smartphone. So this makes it easy to share results across organizations. You can also use jump live if you wanted the reports to be on a restricted server.   And then the script will recreate the historical data and the data table in case you want to run the example multiple times.   Okay, so let's run the streaming script.   And look at how the report updates.   You can see the data is in control at first, but then a fault is introduced, there's a large out of control signal, but there's a plant wide control system that's been implemented and the simulation, which brings the system to a new equilibrium   I give this a second to finish.   And now that I've updated the control chart. I'm going to push the results to jump public   On my jump public page I have at first the control chart with the data and control at the beginning.   And this should be updated with the addition of the data.   So if we zoom in on the when the process first went out of control.   Your Jeremy Ash It looks like that was sample 1125 I'm going to color that   And label it.   So that it shows up in other plots and then   In the SP plot it looks like this observation is still in control.   And what chart will catch faults earlier depends on your model. And how many factors, you've chosen   We can also zoom in on   That time point in the contribution plot. And you can see when the process. First goes out of control. There's a large number of variables that are contributing to the out of control signal. But then when the system reaches a new equilibrium, only two variables have large contributions.   So I'm going to remove these heat maps so that I'm more room in the diagnostic section.   And to make everything pretty pretty large so that the text would show up on your screen.   If I hover over the first point that's out of control. You can get a peek at the top 10 contributing variables.   This is great for quickly identifying what variables are contributing the most to the out of control signal. I can also click on that plot and appended to the diagnostic section and   You can see that there's a large number of variables that are contributing to the out of control signal.   zoom in here a little bit.   So if one of the bars is red. This means that variable is out of control.   In a universal control chart. And you can see this by hovering over the bars.   I'm gonna pan, a couple of those   And these graph, let's our IR charts for the individual variables with three sigma control limits.   You'd see for the stripper pressure variable. The observation is out of control in the university control chart, but the variables eventually brought back under control by our control system. And that's true for   Most of the   Large contributing variables and also show you one of the variables where observation is in control.   So once the control system responds many variables are brought back under control and the process reaches   A new equilibrium   But there's obviously a shift in the process. So to identify the variables that are contributing to the shift. And one thing you can look at is a main contribution.   Plot   If I sort this and look at   The variables that are most contributing. It looks like just two variables have large contributions and both of these are measuring the flow rate of react in a in a stream one which is coming into the reactor   And these are measuring essentially the same thing except one is a measurement variable and one's a manipulated variable. And you can see   In the university control chart that there's a large step change in the flow rate.   This one as well. And this is the step change that I programmed in the simulation. So these contributions allow us to quickly identify the root cause.   So I'm going to present a few other alternate methods to identify the same cause of the shift. And the reason is that in real data.   Process shifts are often more subtle and some of the tools may be more useful and identifying them than others and will consistently arrive at the same conclusion with these alternate methods. So it'll show some of the ways that these methods are connected   Down here, I have a score plant which can provide supplementary information about shifts in the t squared plant.   It's more limited in its ability to capture high dimensional shifts, because only two dimensions of the model are visualized at a time, however, we can provide a more intuitive visualization of the process as it visuals visualizes it in a low dimensional representation   And in fact, one of the main reasons why multivariate control charts are split into t squared and SPE in the first place is that it provides enough dimensionality reduction to easily visualize the process and the scatter plot.   So we want to identify the variables that are   Causing the shift. So I'm going to, I'm going to color the points before and after the shift.   So that they show up in the score plot.   Typically, when we look through all combinations of the six factors, but that's a lot of score plots to look through   So something that's very handy is the ability to cycle through all combinations quickly with this arrow down here and we can look through the factor combinations and find one where there's large separation.   And if we wanted to identify where the shift first occurred in the score plots, you can connect the dots and see that the shift occurred around 1125 again.   Another useful tool. If you want to identify   Score dimensions, where an observation shows the largest separation from the historical data and you don't want to look through all the score plots is the normalized score plot. So I'm going to select a point after the shift and look at the normalized score plot.   I'm actually going to choose another one.   Okay. Jeremy Ash Because I want to look at dimensions, five, and six. So the   These plots show the magnitude of the score and each dimension normalized, so that the dimensions are on the same scale. And since the mean of the historical data is is that zero for each score to mention the dimensions with the largest magnitude will show the largest separation.   Between the selected point and the historical data. So it looks like here, the dimensions, five and six show the greatest separation and   I'm going to move to those   So there's large separation here between our   Shifted data and the historical data and square plot visualization is can also be more interpreted well because you can use the variable loadings to assign meaning to the factors.   And   Here I have   We have too many variables to see all the labels for them.   Loading vectors, but you can hover over and see them. And you can see, if I look in the direction of the shift that the two variables that were the cause show up there as well.   We can also explore differences between sub groups in the process with the group comparisons to do that I'll select all the points before the shift in call that the reference group and everything after in call that the group I'm comparing to the reference   These   And this contribution plot will will give me the variables that are contributing the most to the difference between these two groups. And you can see that this also identifies the variables that caused the shift.   The group comparisons tool is particularly useful when there's multiple shifts in a score plot are when you can see more than two distinct subgroups in your data.   In our case, as, as we're comparing a group in our current data to the historical data. We could also just select the data after the shift and look at a main contribution score plot.   And this will give us   The average contributions of each variable to the scores in the orange group. And since large scores indicate large difference from the historical data. These contribution plots can also identify the cause.   These are using the same formula is the contribution formula for t squared. But now we're just using the, the two factors from the score plot.   Okay, I'm gonna find my PowerPoint again.   So real quick, I'm going to summarize the key features of the model driven multi variant control chart that were shown in the demo.   The platform is capable of performing both online fault detection and offline fault diagnosis. There are many methods, providing the platform for drilling down to the root cause of the faults.   I'm showing you. Here's some plots from the popular book fault detection and diagnosis in industrial systems throughout the book authors.   Demonstrate how one needs to use multivariate and universal control charts side by side to get a sense of what's going on in the process.   And one particularly useful feature and model driven multivariate control chart is how interactive and user friendly. It is to switch between these types of charts.   So that's my talk here. Here's my email. If you have any further questions, and thanks to everyone who tuned in to watch this.
John Cromer, Sr. Research Statistician Developer, JMP   While the value of a good visualization in summarizing research results is difficult to overstate, selection of the right medium for sharing with colleagues, industry peers and the greater community is equally important. In this presentation, we will walk through the spectrum of formats used for disseminating data, results and visualizations, and discuss the benefits and limitations of each. A brief overview of JMP Live features sets the stage for an exciting array of potential applications. We will demonstrate how to publish JMP graphics to JMP Live using the rich interactive interface and scripting methods, providing examples and guidance for choosing the best approach. The presentation culminates with a showcase of a custom JMP Live publishing interface for JMP Clinical results, including the considerations made in designing the dialog, the mechanics of the publishing framework, the structure of JMP Live reports and their relationship to the JMP Clinical client reports and a discussion of potential consumption patterns for published reviews.     Auto-generated transcript...   Speaker Transcript John Cromer Hello everyone, Today I'd like to talk about two powerful products that extend JMP in exciting ways. One of them, JMP Clinical, offers rich visualization, analytical and data management capabilities for ensuring clinical trial safety and efficacy. The other, JMP Live, extends these visualizations to a secure and convenient platform that allows for a wider group of users to interact with them from a web browser. As data analysis and visualization becomes increasingly collaborative, it is important that both creating and sharing is easy. By the end of this talk, you'll see just how easy it is. First, I'd like to introduce the term collaborative visualization. Isenberg, et al., defines it as the shared use of computer supported interactive visual representations of data on more than one person with a common goal of contribution to join information processing activities. As I'll later demonstrate, this definition captures the essence of what JMP, JMP Clinical and JMP Live can provide. When thinking about the various situations in which collaborative visualization occurs, it is useful to consult the Space Time Matrix. In the upper left of this matrix, we have the traditional model of classroom learning and office meetings, with all participants at the same place at the same time. Next in the upper right, we have participants at different places interacting with the visualization at the same time. In the lower left, we have participants interacting at different times at the same location, such as in the case of shift workers. And finally, in the lower right, we have flexibility in both space and time with participants potentially located anywhere around the globe and interacting with the visualization at any time of day. So JMP Live can facilitate this scenario. A second way to slice through the modes of collaborative visualization is by thinking about the necessary level of engagement for participants. When simply browsing a few high-level graphs or tables, sometimes simple viewing can be sufficient. But with more complex graphics and for those in which the data connections have been preserved between the graphs and underlying data tables, users can greatly benefit by also having the ability to interact with and explore the data. This may include choosing a different column of interest, selecting different levels in a data filter and exposing detailed data point hover text. Finally, authors who create visualizations often have a need to share them with others and by necessity will also have the ability to view, interact with and explore the data. and JMP and JMP Clinical for authors who require all abilities. A third way to think about formats and solutions is by the interactivity spectrum. Static reports, such as PDFs, are perhaps the simplest and most portable, but generally, the least interactive Interactive HTML, also known as HTML5, offers responsive graphics and hover text. JMP Live is built on an HTML5 foundation, but also offer server-side computations for regenerating the analysis. While the features of JMP Live will continue to grow over time, JMP offers even more interactivity. And finally, There are industry-specific solutions such as JMP Clinical which are built on a front framework of JMP and SAS that offer all of JMP's interactivity, but with some additional specialization. So when we lay these out on the interactivity spectrum, we can see that JMP Live fills the sweet spot of being portable enough for those with only a web browser to access, while offering many of the prime interactive features that JMP provides So the product that I'll use to demonstrate creating a visualization is JMP Clinical. JMP Clinical, as I mentioned before, offers a way to conveniently assess clinical trial safety and efficacy. With several role-based workflows for medical monitors, writers, clinical operations and data managers, and three review templates, predefined or custom workflows can be conveniently reused on multiple studies, producing results that allow for easy exploration of trends and outliers. Several formats are available for sharing these results, from static reports and in-product review viewer and new to JMP Clinical ??? and JMP Live reports. The product I'll use to demonstrate interacting with on a shared platform is JMP Live. JMP Live allows users with only a web browser to securely and conveniently interact with the visualizations, and they could specify access restrictions for who can view both the graphics and the underlying data tables with the ability to publish a local data filter and column switcher. The view can be refreshed in just a matter of seconds. Users can additionally organize their web reports through titles, descriptions and thumbnails and leave comments that facilitate discussion between all interested parties. So explore the data on your desktop with JMP or JMP Clinical, published a JMP Live with just a few quick steps, share the results with colleagues across your organization, and enrich the shared experience through communication and automation. So now I would like to demonstrate how to publish a simple graphic from JMP to JMP Live. I'm going to open the demographics data set from the sample study Nicardipine, which is included with JMP Clinical. I can do this either through the file open menu where I can navigate to my data set dt= open then the path to my data table. So I'm going to click run scripts to open that data table. Okay. So now I'd like to create a simple visualization. I'm going to, let's say, I'd like to create a simple box plot. Or click graph, Graph Builder. And here I have a dialogue from moving variables into roles. I'm going to move the study site identifier into the X role. Age into Y. And click box plot. And click Done. So here's one quick and easy way to create a visualization in JMP. Alternatively, I can do the same thing with the script. And so this block of code I have here, this encapsulates a data filter and a Graph Builder box plot into a data filter context box. So I'm going to run this block of code. And here you see, I have some filters and a box plot. Now, notice how interactive this filter is and the corresponding graph. I can select a different lower bound for age; I can type in a precise value, let's say, I'd like to exclude those under 30 and suppose I am interested in only the first 10 study side identifiers. OK. So now I'd like to share this visualization with some of my colleagues who don't have JMP but they have JMP Live. So one way to publish this to JMP Live is interactively through the file published menu. And here I have options for for my web report. Can see I have options for specifying a title, description. I can add images. I can choose who to share this report with. So at this point, I could publish this, but I'd like to show you how to do so using the script. So I have this chunk of code where I create a new web report object. I add my JMP report to the web report object. I issue the public message to the web report, and then I automatically open the URL. So let me go ahead and run that. You can see that I'm automatically taken to JMP Live with a very similar structure as my client report. My filter selections have been preserved. I can make filter selection changes. For example, I can move the lower bound for age down and notice also I have detailed data point hover text. I have filter-specific options. And I also have platform-specific options. So any time you see these menus. You can further explore those to see what options are available. Alright, so now that you've seen how to publish a simple graphic from JMP to JMP Live. How about a complex one, as in the case of a JMP Clinical report. So what I'm going to do is open a new review. I will add the adverse events distribution report to this review. I will run it with all default settings. And now I have my adverse events distribution report, which consists of column switchers for demographic grouping and stalking, report filters, an adverse events counts graph, tabulate object for counts and some distributions. Suppose I'm interested in stacking my adverse events by severity. I've selected that and now I have my stoplight colors that I've set for my adverse events for mild, moderate and severe. At this point I'm...I'd like to share these results with a colleague who maybe in this case has JMP, but there are certain times where they prefer to work through a web browser to to inspect and take a look at the visualizations. So this point, I will click this report level create live report button. I will... ...and that...and now I have my dialogue, I can choose to publish to either file or JMP Live. I can choose whether to publish the data tables or not, but I would always recommend to publish them for maximum interactivity. I can also specify whether to allow my colleagues to download the data tables from JMP Live. In addition to the URL, you can specify whether to share the results only with yourself, everyone at your organization or with specific groups. So for demonstration purposes, I will only publish for myself. I'll click OK. Got a notification to say that my web report has been published. Over on JMP Live, I have a very similar structure. At my report filters, my column switchers with my column, a column of interest preserved. You can see my axes and legends and colors have also carried over. Within this web report, I can easily collapse or expand particular report sections, and many of the sections off also offer detailed data point hover text and responsive updates for data filter changes. Another thing I'd like to point out is this Details button in the upper right of the live report, where I can get detailed creation information, a list of the data tables that republished, as well as the script. And because I've given users the ability to download these tables and scripts, these are download buttons for those for that purpose. I can also leave comments from my colleagues that they can then read and take further action on, for example, to follow up on an analysis. All right, so from my final demo, I would simply like to extend my single clinical report to a review consisting of two other reports enrollment patterns, and findings bubble plot. So I'm going to run these reports. Enrollment patterns plots patient enrollment over the course of a study by things like start date of disposition event, study day and study site identifier. Findings bubble plot, I will run on the laboratory test results domain. And this report features a prominent animated bubble plot, in which you can launch this animation. You can see how specific test results change over the course of a study. You can pause the animation. You can scroll to specific, precise values for study day and you can also hover over data points to reveal the detailed information for each of those points. create live report for review. I have a...have the same dialogue that you've seen earlier, same options, and I'm just going to go ahead and publish this now so you can see what it looks like when I have three clinical reports bundled together and in one publication. So when this operation completes, you will see that will be taken to an index page corresponding to report sections. And each thumbnail on this page corresponds to report section in which we have our binoculars icon on the lower left, that indicates how many views each page had. I have a three dot menu, where you can get back to that details view. If you click Edit, from here you can also see creation information and a list of data tables and scripts. And by clicking any of these thumbnails, I can get down to the report, the specific web report of interest. So just because this is one of my favorite interactive features, I've chosen to show you the findings bubble plot on JMP Live. Notice that it has carried over our study day, where we left off on the client, on study day 7. I can continue this animation. You can see study day counting up and you can see how our test results change over time. I can pause this again. I can get to a specific study day. I can do things like change bubble size to suit your preference. Again, I have data point hover text, I can select multiple data points and I have numerous platform specific options that will vary, but I encourage you to take a look at these anytime you see this three dot menu. So to wrap up, let me just jump to my second-last slide. So how was all this possible? Well, behind the scenes, the code to publish a complex clinical report is simply a JSL script that systematically analyzes a list of graphical report object references and pairs them with the appropriate data filters, column switchers, and report sections into a web report object. The JSL publish command takes care of a lot of the work for you, for bundling the appropriate data tables into the web report and ensuring that the desired visibility is met. Power users who have both products can use the download features on JMP Live to conveniently share to conveniently adjust the changes ...to to... make changes on their clients and to update their... the report that was initially published, even if they were not the original authors. And then the cycle can continue, of collaboration between those on the client and those on JMP Live. So, as you can see, both creating and sharing is easy. With JMP and JMP Clinical, collaborative visualization is truly possible. I hope you've enjoyed this presentation, and I look forward to any questions that you may have.  
Lucas Beverlin, Statistician, Intel, Corp.   The Model Comparison platform is an excellent tool for comparing various models fit within JMP Pro. However, it also has the ability to compare models fit from other software as well. In this presentation, we will use the Model Comparison platform to compare various models fit to the well-known Boston housing data set in JMP Pro15, Python, MATLAB, and R. Although JMP can interact with those environments, the Model Comparison platform can be used to analyze models fit from any software that can output its predictions.     Auto-generated transcript...   Speaker Transcript Lucas Okay, thanks everyone for coming and listen to my talk. My name is Lucas Beverlin. I'm a statistician at Intel. And today I'm going to talk about using JMP to compare models from various environments. Okay so currently JMP 15 Pro is the latest and greatest that JMP has out and if you want to fit the model and they've got several different tools to do that. There's the fit model platform. There's the neural platform on neural network partition platform. If you want classification and regression trees. The nonlinear platform for non linear modeling and there's several more. And so within JMP 15 I think it came out in 12 or 13 but this model comparison platform is a very nifty tool that you can use to compare model fits from various platforms within JMP. So if you have a tree and a neural network and you're not really sure which one's better. Okay, you could flip back and forth between the two. But now with this, you have everything on the same screen. It's very quick and easy to tell, is this better that that, so on so forth. So that being said, JMP can fit a lot of things, but, alas, it can't fit everything. So just give a few ideas of some things that can't fit. So, for example, those that do a lot of machine learning and AI might fit something like an auto encoder or convolutional neural network that generally requires lots of activation functions or yes, lots of hidden layers nodes, other activation functions than what's offered by JMP so JMP's not going to be able to do a whole lot of that within JMP. Another one is something called projection pursuit regression. Another one is called multivariate adaptive regression splines. So there are a few things unfortunately JMP can't do. R, Python, and MATLAB. There's several more out there, but I'm going to focus on those three. Now that being said, the ideas I'm going to discuss here, you want to go fit them in C++ or Java or Rust or whatever other language comes to mind, you should be able to use a lot of those. So we can use the model comparison platform, as I've said, to compare from other software as well. So what will you need? So the two big things you're going to need are the model predictions from whatever software you use to fit the model. And generally, when we do model fitting, particularly with larger models, you may split the data into training validation and/or test sets. You're going to need something that tells all the software which is training, which is validation, which is test, because you're going to want those to be consistent when you're comparing the fits. OK, so the biggest reason I chose R, Python and MATLAB to focus on for this talk is that turns out JMP and scripting language can actually create their own sessions of R and run code from it. So this picture here just shows very quickly if I wanted to fit a linear regression model to some output To to the Boston housing data set. I'll work a lot more with that data set later. But if you wanted to just very quickly fit a linear regression model in R and spit out the predictive values, you can do that. Of course you can do that JMP. But just to give a very simple idea. So, one thing to note so I'm using R 3.6.3 but JMP can handle anything as long as it's greater than 2.9. And then similarly, Python, you can call your own Python session. So here the picture shows I fit the linear regression with Python. I'm not going to step through all the lines of code here but you get the basic idea. Now, of course, with Python be a little bit careful in that the newest version of Python 3.8.5. But if you use Anaconda to install things, JMP has problems talking to it when it's greater than 3.6 so since I'm using 3.6.5 for this demonstration. And then lastly, we can create our own MATLAB session as well. So here I'm using MATLAB 2019b. But basically, as long as your MATLAB version has come out in the last seven or eight years, it should work just fine. Okay, so how do we tie everything together? So really, there's kind of a four-step process we're going to look at here. So first off, we want to fit each model. So we'll send each software the data and which set each observation resides. Once we have the models fit, we want to output those fits and their predictions and add them to a data table that JMP can look at. So of course my warning is, be sure you name things that you can tell where did you fit the model or how did you fit the model. I've examples of both coming up. So step three, depending on the model and you may want to look at some model diagnostics. Just because a model fits...appears to fit well based on the numbers, one look at your residual plot, for example, and you may find out real quickly the area of biggest interest is not fit very well. Or there's something wrong with residuals so on so forth. So we'll show how to output that sort of stuff as well. And then lastly we'll use the model comparison platform, really, to bring everything into one big table to compare numbers, much more easily as opposed to flipping back and forth and forth and back. Okay, so we'll break down the steps into a little more detail now. So for the first step where we do model fitting, we essentially have two options. So first off, we can tell JMP via JSL to call your software of choice. Send it the data and the code to fit it. And so, in fact, I'm gonna jump out of this for a moment and do exactly that. So you see here, I have some code for actually calling R. And then once it's done, I'll call Python and once it's done, I'll call MATLAB and then I'll tie everything together. Now I'll say more about the code here in a little bit, but it will take probably three or four minutes to run. So I'm going to do that now. And we'll come back to him when we're ready. So our other option is we create a data set with the validation. Well, we create a data set with the validation column and and/or a test column, depending on how many sets were splitting our data into. We're scheduled to run on whatever software, we need to run on, of course output from that whatever it is we need. So of course a few warnings. Make sure you're...whatever software you're using actually has what you need to fit the model. Make sure the model is finished fitting before you try to compare it to things. Make sure the output format is something JMP can actually read. Thankfully JMP can read quite a few things, so that's not the biggest of the four warnings. But as I've warned you earlier, make sure the predictions from each model correspond to the correct observations from the original data set. And so that comes back to the if it's training, if it's a training observation, when you fit it in JMP, it better be a training observation when you fit it in whatever software using. If it's validation in JMP, it is the validation elsewhere. It's test in JMP, it's going to be test elsewhere. So make sure things correspond correctly because the last thing we want to find out is to look at test sets and say, "Oh, this one fit way better." Well, it's because the observations fit in it were weighted different and didn't have any real outliers. So that ends up skewing your thinking. So a word of caution, excuse me, a word of caution there. Okay. So as I've alluded to, I have an example I'm currently running in the background. And so I want to give a little bit of detail as far as what I'm doing. So it turns out I'm going to fit neural networks in R and Python and MATLAB. So if I want to go about doing that, within R, two packages I need to install in R on top of whatever base installing have and that's the Keras package and the Tensorflow package. numpy, pandas and matplotlib. So numpy to do some calculations pretty easily; pandas, pandas to do data...some data manipulation; and matplotlib should be pretty straightforward to create some plots. And then in MATLAB I use the deep learning tool box, whether you have access to that are not. Okay. So step two, we want to add predictions to the JMP data table. So if you use JMP to call the software, you can use JSL code to retrieve those predictions and add them into a data table so then you can compare them later on. So then the other way you can go about doing is that the software ran on its own and save the output, you can quickly tell JMP, hey go pull that output file and then do some manipulation to bring the predictions into whatever data table you have storing your results. So now that we can also read the diagnostic plots. In this case what I generally am going to do is, I'm going to save those diagnostic plots as graphics files. So for me, it's going to be PNG files. But of course, whichever graphics you can use. Now JMP can't hit every single one under the sun, but I believe PNG that maps jpgs and some and they...they have your usual ones covered. So the second note I use this for the model comparison platform, but to help identify what you...what model you fit and where you fit it, I generally recommend adding the following property for each prediction column that you add. And so we see here, we're sending the data table of interest, this property called predicting. And so here we have the whatever it is you're using to predict things (now here in value probably isn't the best choice here) but but with this creator, this tells me, hey, what software did I use to actually create this model. And so here I used R. It shows R so this would actually fit on the screen. Python and MATLAB were a little too long, but we can put whatever string we want here. You'll see those when I go through the code in a little more detail here shortly. So, and this comes in handy because I'm going to fit multiple models within R later as well. So if I choose the column names properly and I have multiple ones where R created it, I still know what model I'm actually looking at. Okay, so this is what the typical model comparison dialog box looks like. So one thing I'm going to note is that, so this is roughly what it would look like if I did a point and click at the end of all the model fitting. So you can see I have several predictors. So I've neural nets for a MATLAB, Python and R. Various prediction forms; I used to JMP to fit a few things. Now, oftentimes what folks will do is, they'll put this validation column as a group, so that it'll group the training validation and test. I actually like the output a bit better when I stick it in the By statement here. So I'll show that here a little later. But you can put it either or but I like the output this way better is the long and short of it. So this is the biggest reason why it is now I can clearly see, these are all the training, these are all the validation (shouldn't see by the headers) and these are all the test. If you use validation as a group variable, you're going to get one big table with 21 entries in it. Now, there'll be an extra column. It says training validation test or in my case, it will be 012 but this way with the words, I don't have to think as hard. I don't have to explain to anyone what 012 means so on so forth. So that was why I made the choice that I did. Okay, so in the example I'm gonna break down here, I'm going to use the classic Boston housing data set. Now this is included within JMP. So that's why I didn't include it as a file in my presentation because if you have JMP, you've already got it. So Harrison and Rubinfeld had several predictors of the median value of the house, such things such as per capita crime rate, the proportion of non retail business acres per town, average number of rooms within whatever it is you're trying to buy, pupil to the teacher ratio by town (so if there's a lot of teachers and not quite as many students that generally means better education is what a lot of people found) and several others. I'm not gonna really try to go through all 13 of them here. Okay, so let me give a little bit of background as far as what models I looked at here. And then I'm going to delve into the JSL code and how I fit everything. So some of the models, I've looked at. So first off, the quintessential linear regression model. So here you just see a simple linear regression. I just fit the median value to, looks like, by tax rate. But of course I'll use a multiple linear regression and use all of them. So, But with 13 different predictors, and knowing some of them might be correlated to one another, I decided that maybe a few other types of regression would be worth looking at. One of them is something called bridge regression. So really all it is, is it's linear regression with essentially an added constraint that the squared values of my parameters can't be larger than some constant. I can...turns out I can actually rewrite this as an optimization problem where some value of lambda corresponds to some value of C. And so then I'm just trying to minimize this with this extra penalty term, as opposed to the typical least squares that you're used to seeing. Now, this is called a shrinkage method because of course as I make this smaller and smaller, it's going to push all these closer and closer to zero. So, of course, some thought needs to be put into how restrictive do I want it to be. Now with shrinkage, it's going to push everybody slowly towards zero. But with another type of penalty term, I can actually eliminate some terms altogether. And I can use something called the lasso. And so the contraint here is, okay, instead of the squared parameter estimates, I'm just going to take the sum of the absolute value of those parameter estimates. And so it turns out from that, what'll actually happen is those that are very weak actually get their parameter estimates set to zero itself, which kind of serves as a elimination, if you will, of unimportant terms. So to give a little bit of a visual as to what lasso and ridge regression are doing. So for ridge regression, the circle here represents the penalty term. And here we're looking at the parameter space. And so the true least squares estimates would be here. So we're not quite getting there, because we have this additional constraint. So in the end, we find where does...where do we get the minimum value that touches the circle, basically. And so this is, this would be our ridge regression parameter estimates. For lasso, similar drawing, but you can see now with the absolute value, this is more of a diamond as opposed to a circle. Now note, this is two dimensions, of course, we're going to get into hyper spheres and all those shapes. But you can see here, notice it touches right at the tip of the diamond. And so in this case beta one is actually zero. So that's how it eliminates terms. Okay, so another thing we're going to look at is what's called a regression tree. Now JMP uses the partition platforms to do these. So just to give a very quick demo of what this shows, in that, ok so I have all of my data. And so my first question I'll ask myself is, how many rooms are in the dwelling, and I know I can't have .943 of a room, so basically, I have six rooms or less. So come down this part of the tree, let's not come down this part of the tree. Now if I have seven rooms or more, this tells me immediately I'm going to predict my median value to be 37. Remember it's in tens of thousands of dollars, so make that $370,000. If it's less than seven, then the next question I asked is, well, how big is lstat? So if it's bigger than or equal to 14.43, I'll look at this node and suddenly my median housing estimates about 150 grand and if I come over here, it's gonna be about 233 grand. So what regression trees really do is they're partitioning your input space into different areas. And we'er giving the same prediction to every value within that area. So you can see here I've partitioned...now on this case, I'm taking a two dimensional one because it's easier to draw... and so you can see this tree here, where I first look at x1. Now look at x2 here and ask another question about x1 and ask another question about x2, and this is how I end up partitioning the input space. Now each of these five is going to have a prediction value. And that's essentially what this looks like. I look at this from up top. I'm going to get exactly this. But you can see here that the prediction is a little bit different depending upon which of the five areas right. Now, I'm not going to get into too much of the details on how exactly to fit one of these, but James, Witten, Tibshirani and Friedman give a little bit; Leo ??? wrote the seminal book on it so you can take a look there. So next off, I'll come to neural networks, which are being used a lot in machine learning and whatnot these days. And so this kind of gives a visual of what a neural network looks like. So here, this visual just uses five and the 13 inputs when passing them to these...this hidden layer. And each of these is transformed via an activation function. And for each of these activation functions, you get an output and we'll use... oftentimes, we'll just use a linear regression of these outputs to predict the median value. Okay, so really, neural network are nonlinear models, at the end of the day, and really, they're called neural networks because the representation generally is how we viewed neurons as working within the human brain. So each input can be passed to nodes in a hidden layer. At the hidden layer your inputs are pushed through an activation function and some output is calculated and each output can be passed to a node in another hidden layer or be an output of the network. Now within JMP, you're only allowed two hidden layers. Truth be told, as far as creating a neural network, there's nothing that says you can't have 20 for all that we're concerned about now. Truth be told, there's some statistical theory that suggests that hey, we can approximate any continuous function, given a few boundary conditions, with two hidden layers. So that's likely why JMP made the decision that they did. linear, hyperbolic tangent and Gaussian radial basis. So in fact, on these nodes here, notice the little curve here. I believe that is for the hyperbolic tangent function; linear will be a straight line going up; and Gaussian radial basis, I believe, will look more like a normal curve. That's the neural network platform. So the last one we'll look at is something called projection pursuit regression. I wanted to pull something that JMP simply can't do just kind of give an example here. Um, so projection pursuit regression was a model originally proposed by Jerome Friedman and Steutzle over at Stanford. Their model takes prediction...makes predictions of the form y equals the summation of beta i, f sub i, and a linear transformation of your inputs. So really this is somewhat analogous to a neural network. You have one hidden layer here with k nodes and each with activation function f i L. Turns out with projection pursue regression, we're actually going to estimate these f sub i as well. Generally they're going to be some sort of smoother or a spline fit. Typically the f sub i are called ridge functions. Now we have alphas, we have betas and we have Fs we need to optimize over. So generally a stagewise fitting is done. I'm not going to get too deep in the details at this point. Okay, so I've kind of gone through all my models. So now I'm going to show some output and hopefully things look good. So one thing I'm going to note before I get into JMP is that it's really hard to set seeds for the neural networks in R, Python for Keras. So do note that if you take my code and run it, you're probably not going to get exactly what I got, but it should be pretty close. So with that said, let's see what we got here. So this was the output that I got. Now, unfortunately, things do not appear to have run perfectly. So, Lucas what do I have here? So I have my training, my validation, and my test. And so we see very quickly that one of these models didn't fit very well. The neural net within R unfortunately something horrible happened. It must have caught a bad spot in the input space to start from and whatnot. And so it just didn't fit a very good model. So unfortunately, starting parameters with nonlinear models matter; in some cases, we get bit by them. But if we take a look at everything else, everything else seems to fit decently well. Now what is decently well, we can argue over that, but I'm seeing R squares, one above .5. I'm seeing root average squared errors here around five or so, and even our average absolute errors are in the three range. Now for training, it looks like projection pursuit regressions did best. If I come down to the validation data set, it still looks like R projection pursuit did best. But if we look at the test data set, all of a sudden, no, projection pursuit regression was second, assuming we're gonna ignore the neural net from R, second worst. Oftentimes in a framework like this, we're going to look at the test data set the closest because it wasn't used in any way, shape, or form to determine the model fit. And we see based on that, It looks like the ridge regression from JMP fit best. We can see up here, it's R squared was .71 here before was about .73, and about .73 here, so we can see it's consistently fitting the same thing through all three data sets. So if I were forced to make a decision, just based on what I see at the moment, I would probably go with the ridge regression. So that being said, we have a whole bunch of diagnostics and whatnot down here. So if I want to look at what happened with that neural network from R, I can see very quickly, something happened just a few steps into there. As you can see, it's doing a very lousy job of fitting because pretty much everything is predicted to be 220 some thousand. So we know something went wrong during the fitting of this. So we saw the ridge regression looked like the best one. So let's take a look at what it spits out. So I'll show in a moment my JSL code real quick that shows how I did all this but, um, we can see here's the parameter estimates from the ridge regression. We can see the ridge diagnostic plots, so things hadn't really shrunk too much from the original estimates. You can see from validation testing with log like didn't whatnot. And over here on the right, we have our essentially residual plots. These are actual by predicted. So you can see from the training, looks like there was a few that were rather expensive that didn't get predicted very well. We see fewer here than in the test set, it doesn't really look like we had too much trouble. We have a couple of points here a little odd, but we can see for generally when we're in lower priced houses, it fits all three data sets fairly well. Again, we may want to ask ourselves what happened on these but, at the moment, this appears to be the best of the bunch. So we can see from others. See here. So we'll look at MATLAB for a moment. So you can see training test validation here as well. So here we're spitting out...MATLAB spits out one thing of diagnostics and you can see it took a few epochs to finish so. But thankfully MATLAB runs pretty quickly as we can tell. And then the actual by predicted here. We can see all this. Okay, so I'm going to take a few minutes now to take a look at the code. So of course, a few notes, make sure things are installed so you can actually run all this because if not, JMP's just going to fail miserably, not spit out predictions and then it's going to fail because it can't find the predictions. So JMP has the ability to create a validation column with code. So I did that I chose 60 20 20. I did choose that random seed here so that you can use the same training validation test sets that I do. So actually, for the moment, what I end up doing is I save what which ones are training, validation and test. I'm actually going to delete that column for a little bit. The reason I do that here is because I'm sending the data set to R, Python and MATLAB and it's easier to code when everything in it is either the output or all the inputs. So I didn't want a validation column that wasn't either and then it becomes a little more difficult do that. So what I ended up doing was I sent it the data set, I sent it which rows of training, validation, and test, and then I call the R code to run it. Now you can actually put the actual R code itself within here. I chose to just write one line here so that I don't have to scroll forever. But there's nothing stopping you. If it's only a few lines of code, like what you saw earlier in the presentation, I would just paste it right in here. So that once it's done, it turns out...this code spits out a picture of the diagnostics. We saw it stopped after six or seven iterations, let's have this say is that out. And also fits the ridge regression in this script so we get two pictures. So I spit that one out as well and save it and outline box. Now, these all put together at the end of all the code. And then I get the output and I'll add those to the data table here in a little while. Okay, so I give a little bit of code here in that. Let's say you have 6 million observations and it's going to take 24 hours to actually fit the model, you're probably not going to want to run it within JMP. So as a little bit of code that you could do from here, you can say, hey, I'm okay, I'm going to just open the data table I care about. I'm going to tell R to go run it somewhere else in the meantime, and once my system, when I gives me the green light that hey, it's done, I can say, okay, well go open the output from that and bring it into my data table. So this would be one way you could go about doing some of that. And of course you want to save those picture file somewhere and use this code as well. But this is gonna be the exact same code. Okay, so for Python, it's going to be very similar. I'm going to pass it these things. Run some Python code, spit out the diagnostic plot and spit out the predictions. And I give some Python code, you can see, it's very, very similar to what we did from JMP. I'm just going to go open some CSV file in this case. Copy the column in and close it, because I don't need it anymore. And then MATLAB again the exact same game. Asset things, run the MATLAB script. I get the PNG file that I spat out of here. Save it where I need to, save the predictions. And if you need to grab it rather than run it from within here, a little bit of sample code will do that. OK, so now that I'm done calling R, Python and Matlab, I bring back my validation columns so that JMP can use it. So since I remember which one's which. So by default, JMP looks at the values within the validation column which we'll use. The smallest value is training, the next largest is validation, the largest is test. Now if you do K fold cross validation, it'll tell it which fold it is. So coursing though 012345678 so on so forth. So then create this. I also then in turn created this here, so that way instead of 012, it'll actually say training, validation, and test in my output, so it's a little clearer to understand. So if I'm going to show someone else that's never run JMP before, they're not going to know what 012 means, but they should know a training, validation and test are. OK, so now I start adding the predictions to my data table. Um, so here's that set property I alluded to earlier in my talk. So my creator's MATLAB, I've given the column name so, hey, I know it's the neural net prediction for MATLAB. So I may not necessarily need the creator, but in case I'm a little sloppy in naming things Sorry about that. So we can get all the projection pursuit regression, neural nets, and whatnot. Then I also noted that, hey, I fit ridge regression, lasso, linear regression in JMP. So I did all that here. So here I do my fit model, my generalize regression. Get all these spat out. Sve my prediction formulas. Plot my actual by predicted for my full output at the end. And I'm going to fit my neural network. Can I say the validation column. I transfer my covariates, generally neural networks tend to fit a little bit better when we scale things around zero as opposed to whatever the output is usually at. So my first hidden layer has three nodes. My second hidden layer has two nodes. Here they're both linear activation functions. Turns out for the three above, I use the restricted linear units activation function so slightly different, but I found they seem to fit about the same regardless. 5. What that means is, hey, I'm going to try five different sets of starting values, whichever one does best is what I'm going to keep. As you can tell from my code, I probably should have done that with the R. It's done kind of a four loop, done several of them, spit out the one that does best. So for future work, that would be one spot I would go. So then I save that stuff out and now I'm ready for the model comparison. So now I bring all those new columns in the model comparison. Scroll over a little bit. So here I'm doing the by validation, as I alluded to earlier. And so lastly I'm just doing a bit of coding to essentially make it look the way I want it to look. So I get these outline boxes, just to say training diagnostics, validation diagnostics, test diagnosis, instead of the usual stuff that JMPs says. I'm gonna get these diagnostic plots. Now here I'm just saying I just only want part of the outputs on grabbing a handle to that. I'm going to make some residual plots real quick because not all of them instantly spit those out, so particularly ones from MATLAB, Python and R. Set those titles and then here I just create the big old table or the big old dialogue. And then I journal everything. So that it's nice and clean. Close a bunch of stuff out, so I don't have to worry about things. And then what I did here at the end is what I wanted to happen is when I pop one of these open, everything else below it is immediately open rather than having to click on six or seven different things. You can see, I have to click here and here and Over here, there's three more. I guess one more. Sorry. But this way, I don't have to click on any of these, they're automatically open. So that's what this last bit of code does. Okay. Lucas So this is just different output it and run it live. But this is where it can also look like. So as I mentioned, so in the code, you saw there, you saw something else that's what we saw. Richard Lucas Nope. So to wrap everything up the model comparison platform is really a very nice tool for comparing the predictive ability of multiple models in one place. You don't have to cut back and forth between various things. You can just look at everything right in front of you. The flexibility can even be used to fit models that weren't fit in JMP or compare models that weren't even fit in JMP. And so with this, if we need to fit very large models that take a long time to fit, we can tell them to go fit. Pull everything in JMP and very easily look at all the results to try to determine next steps. And with that, thank you for your time.  
Daniel Sutton, Statistician - Innovation, Samsung Austin Semiconductor   Structured Problem Solving (SPS) tools were made available to JMP users through a JSL script center as a menu add-in. The SPS script center allowed JMP users to find useful SPS resources from within a JMP session, instead of having to search for various tools and templates in other locations. The current JMP Cause and Effect diagram platform was enhanced with JSL to allow JMP users the ability to transform tables between wide format for brainstorming and tall format for visual representation. New branches and “parking lot” ideas are also captured in the wide format before returning to the tall format for visual representation. By using JSL, access to mind-mapping files made by open source software such as Freeplane was made available to JMP users, to go back and forth between JMP and mind-mapping. This flexibility allowed users to freeform in mind maps then structure them back in JMP. Users could assign labels such as Experiment, Constant and Noise to the causes and identify what should go into the DOE platforms for root cause analysis. Further proposed enhancements to the JMP Cause and Effect Diagram are discussed.     Auto-generated transcript...   Speaker Transcript Rene and Dan Welcome to structured, problem solving, using the JMP cause and effect diagram open source mind mapping software and JSL. My name is Dan Sutton name is statistician at Samsung Austin Semiconductor where I teach statistics and statistical software such as JMP. For the outline of my talk today, I will first discuss what is structured problem solving, or SPS. I will show you what we have done at Samsung Austin Semiconductor using JMP and JSL to create a SPS script center. Next, I'll go over the current JMP cause and effect diagram and show how we at Samsung Austin Semiconductor use JSL to work with the JMP cause and effect diagram. I will then introduce you to my mapping software such as Freeplane, a free open source software. I will then return to the cause and effect diagram and show how to use the third column option of labels for marking experiment, controlled, and noise factors. I want to show you how to extend cause and effect diagrams for five why's and cause mapping and finally recommendations for the JMP cause and effect platform. Structured problem solving. So everyone has been involved with problem solving at work, school or home, but what do we mean by structured problem solving? It means taking unstructured, problem solving, such as in a brainstorming session and giving it structure and documentation as in a diagram that can be saved, manipulated and reused. Why use structured problem solving? One important reason is to avoid jumping to conclusions for more difficult problems. In the JMP Ishikawa example, there might be an increase in defects in circuit boards. Your SME, or subject matter expert, is convinced it must be the temperature controller on the folder...on the solder process again. But having a saved structure as in the causes of ...cause and effect diagram allows everyone to see the big picture and look for more clues. Maybe it is temperate control on the solder process, but a team member remembers seeing on the diagram that there was a recent change in the component insertion process and that the team should investigate In the free online training from JMP called Statistical Thinking in Industrial Problem Solving, or STIPS for short, the first module is titled statistical thinking and problem solving. Structured problem solving tools such as cause and effect diagrams and the five why's are introduced in this module. If you have not taken advantage of the free online training through STIPS, I strongly encourage you to check it out. Go to www.JMP.com/statisticalthinking. This is the cause and effect diagram shown during the first module. In this example, the team decided to focus on an experiment involving three factors. This is after creating, discussing, revisiting, and using the cause and effect diagram for the structured problem solving. Now let's look at the SPS script center that we developed at the Samsung Austin Semiconductor. At Samsung Austin Semiconductor, JMP users wanted access to SPS tools and templates from within the JMP window, instead of searching through various folders, drives, saved links or other software. A floating script center was created to allow access to SPS tools throughout the workday. Over on the right side of the script center are links to other SPS templates in Excel. On the left side of the script center are JMP scripts. It is launched from a customization of the JMP menu. Instead of putting the scripts under add ins, we chose to modify the menu to launch a variety of helpful scripts. Now let's look at the JMP cause and effect diagram. If you have never used this platform, this is what's called the cause and effect diagram looks like in JMP. The user selects a parent column and a child column. The result is the classic fishbone layout. Note the branches alternate left and right and top and bottom to make the diagram more compact for viewing on the user's screen. But the classic fishbone layout is not the only layout available. If you hover over the diagram, you can select change type and then select hierarchy. This produces a hierarchical layout that, in this example, is very wide in the x direction. To make it more compact, you do have the option to rotate the text to the left or you can rotate it to the right, as shown in here in the slides. Instead of rotating just the text, it might be nice to rotate the diagram also to left to right. In this example, the images from the previous slide were rotated in PowerPoint. To illustrate what it might look like if the user had this option in JMP. JMP developers, please take note. As you will see you later, this has more the appearnce of mind mapping software. The third layout option is called nested. This creates a nice compact diagram that may be preferred by some users. Note, you can also rotate the text in the nested option, but maybe not as desired. Did you know the JMP cause and effect diagram can include floating diagrams? For example, parking lots that can come up in a brainstorming session. If a second parent is encountered that's not used as a child, a new diagram will be created. In this example, the team is brainstorming and someone mentions, "We should buy a new machine or used equipment." Now, this idea is not part of the current discussion on causes. So the team facilitator decides to add to the JMP table as a new floating note called a parking lot, the JMP cause and effect diagram will include it. Alright, so now let's look at some examples of using JSL to manipulate the cause and effect diagram. So new scripts to manipulate the traditional JMP cause and effect diagram and associated data table were added to the floating script center. You can see examples of these to the right on this PowerPoint slide. JMP is column based and the column dialogue for the cause and effect platform requires one column for the parent and one column for the child. This table is what is called the tall format. But a wide table format might be more desired at times, such as in brainstorming sessions. With a click of a script button, our JMP users can do this to change from a tall format to a wide format. width and depth. In tall table format you would have to enter the parent each time adding that child. When done in wide format, the user can use the script button to stack the wide C&E table to tall. Another useful script in brainstorming might be taking a selected cell and creating a new category. The team realizes that it may need to add more subcategories under wrong part. A script was added to create a new column from a selected cell while in the wide table format. The facilitator can select the cell, like wrong part, then selecting this script button, a new column is created and subcauses can be entered below. you would hover over wrong part, right click, and select Insert below. You can actually enter up to 10 items. The new causes appear in the diagram. And if you don't like the layout JMP allows moving the text. For example, you can click...right click and move to the other side. JMP cause and effect diagram compacts the window using left and right, up and down, and alternate. Some users may want the classic look of the fishbone diagram, but with all bones in the same direction. By clicking on this script button, current C&E all bones to the left side, it sets them to the left and below. Likewise, you can click another script button that sets them all to the right and below. Now let's discuss mind mapping. In this section we're going to take a look at the classic JMP cause and effect diagram and see how to turn it into something that looks more like mind mapping. This is the same fishbone diagram as a mind map using Freeplane software, which is an open source software. Note the free form of this layout, yet it still provides an overview of causes for the effect. One capability of most mind mapping software is the ability to open and close notes, especially when there is a lot going on in the problem solving discussion. For example, a team might want to close notes (like components, raw card and component insertion) and focus just on the solder process and inspection branches. In Freeplane, closed nodes are represented by circles, where the user can click to open them again. The JMP cause and effect diagram already has the ability to close a note. Once closed though, it is indicated by three dots or three periods or ellipses. In the current versions of JMP, there's actually no options to open it again. So what was our solution? We included a floating window that will open and close any parent column category. So over on the right, you can see alignment, component insertion, components, etc., are all included as all the parent nodes. By clicking on the checkbox, you can close a node and then clicking again will open it. For addtion, the script also highlights the text in red when closed. One reason for using open source mind mapping software like Freeplane is that the source file can be accessed by anyone. And it's not a proprietary format like other mind mapping software. You can actually access it through any kind of text editor. Okay, the entire map can be loaded by using JSL commands that access texts strings. Use JSL to look for XML attributes to get the names of each node. A discussion of XML is beyond the scope of this presentation, but see the JMP Community for additional help and examples. And users at Samsung Austin Semiconductor would click on Make JMP table from a Freeplane.mm file. At this time, we do not have a straight JMP to Freeplane script. It's a little more complicated, but Freeplane does allow users to import text from a clipboard using spaces to knit the nodes. So by placing the text in the journal, the example here is on the left side at this slide, the user can then copy and paste into Freeplane and you would see the Freeplane diagram on the, on the right. Now let's look at adding labels of experiment, controlled, and noise to a cause and effect diagram. Another use of cause and effect diagrams is to categories...categorize specific causes for investigation or improvements. These are often category...categorize as controlled or constant (C), noise or (N) or experiment might be called X or E. For those who we're taught SPC Excel by Air Academy Associates, you might have used or still use the CE/CNX template. So to be able to do this in JMP, to add these characters, we would need to revisit the underlying script. When you actually use the optional third label column...the third column label is used. When a JMP user adds a label columln in the script, it changes the text edit box to a vertical list box with two new horizontal center boxes containing the two... two text edit boxes, one with the original child, and now one with the value from the label column. It actually has a default font color of gray and is applied as illustrated here in this slide. Our solution using JSL was to add a floating window with all the children values specified. Whatever was checked could be updated for E, C or N and added to the table and the diagram. And in fact, different colors could be specified by the script by changing the font color option as shown in the slide. JMP cause and effect diagram for five why's and mind mapping causes. While exploring the cause and effect diagram, another use as a five why's or cause mapping was discovered. Although these SPS tools do not display well on the default fish bone layout, hierarchy layout is ideal for this type of mapping. The parent and child become the why and because statements, and the label column can be used to add numbering for your why's. Sometimes there can be more and this is what it looks like on the right side. Sometimes there can be more than one reason for a why and JMP cause and effect diagram can handle it. This branching or cause mapping can be seen over here on the right. Even the nested layout can be used for a five why. In this example, you can also set up a script to set the text wrap width, so the users do not have to do each box one at a time. Or you can make your own interactive diagram using JSL. Here I'm just showing some example images of what that might look like. You might prompt the user in a window dialogue for their why's and then fill in the table and a diagram for the user. Once again, using the cause and effect diagram as over on the left side of the slide. Conclusions and recommendations. All right. In conclusion, the JMP cause and effect diagram has many excellent built in features already for structured problem solving. The current JMP cause and effect diagram was augmented using JSL scripts to add more options when being used for structured problem solving at Samsung Austin Semiconductor. JSL scripts were also used to make the cause and effect diagram act more like mind mapping software. So, what would be my recommendations? fishbone, hierarchy, nested, which use different types of display boxes in JSL. How about a fourth type of layout? How about mind map that will allow more flexible mind map layout? I'm going to add this to the wish list. And then finally, how about even a total mind map platform? That would be even a bigger wish. Thank you for your time and thank you to Samsung Austin Semiconductor and JMP for this opportunity to participate in the JMP Discovery Summit 2020 online. Thank you.  
Monday, October 12, 2020
Mandy Chambers, JMP Principal Test Engineer, SAS Kelci Miclaus, Senior Manager Advanced Analytics R&D, JMP Life Sciences, SAS, JMP LIfe Sciences   JMP has many ways to join data tables. Using traditional Join you can easily join two tables together. JMP Query Builder enhances the ability to join, providing a rich interface allowing additional options, including inner and outer joins, combining more than two tables and adding new columns, customizations and filtering. In JMP 13, virtual joins for data tables were developed that enable you to use common keys to link multiple tables without using the time and memory necessary to create a joined (denormalized) copy of your data. Virtually joining tables gives a table access to columns from the linked tables for easy data exploration. In JMP 14 and JMP 15, new capabilities were added to allow linked tables to communicate with row state synchronization. Column options allow you to set up a link reference table to listen and/or dispatch row state changes among virtually joined tables. This feature provides an incredibly powerful data exploration interface that avoids unnecessary table manipulations or data duplications. Additionally, there are now selections to use shorter column names, auto-open your tables and a way to go a step further, using a Link & ID and Link & Reference on the same column to virtually “pass through” tables. This presentation will highlight the new features in JMP with examples using human resources data followed by a practical application of these features as implemented in JMP Clinical. We will create a review of multiple metrics on patients in a clinical trial that are virtually linked to a subject demographic table and show how a data filter on the Link ID table enables global filtering throughout all the linked clinical metric (adverse events, labs, etc.) tables.     Auto-generated transcript...   Speaker Transcript Mandy Okay, welcome to our discussion today. Let's Talk Tables. My name is Mandy Chambers and I'm a principal test engineer on the JMP testing team. And my coworker and friend joining me today is Kelci Miclaus. She's a senior manager in R&D on the JMP life sciences team. Kelci and I actually began working together a few years ago as a Clinical product was starting to be a great consumer of all the things that I happen to test. So I got to know her group pretty well and got to work with them closely on different things that they were trying to implement. And it was really valuable for me to be able to see a live application that a customer would really be using the things that I actually tested in JMP and how they would put them to use them in the clinical product. So in the past, we've done this presentation. It's much longer and we decided that the best thing to do here was to give you the entire document. So that's what's attached with this recording, along with two sets of data and zip files. You should have data tables, scripts, some journals and different things you need and be able to step through each of the applications, even if I end up not showing you or Kecli doesn't show you something that's in there. You should be able to do that. So let me begin by sharing my screen here. So that you can see what what I'm going to talk about today. So as I said, the, the journal that I had, if I were going to show this in its entirety, would be talking about joining tables and the different ways that you can join tables. And so this is the part that I'm not going to go into great detail on but just a basic table join. If I click on this, laptop runs and laptop subjects. And under the tables menu, if you're new to JMP or maybe haven't done this before, you can do a table join and this is a for physical join. This will put the tables together. So I would be joining laptop runs to laptops subjects. Within this dialogue, you select the things that you want to join together. You can join by matching, Cartesian join, row join and then you would join the table. I'm not going to do that right now, just for time consumption but that's that's what you would do. And also in here under the tables menu, something else that I would talk about would be JMP query builder. And this has the ability to be able to join more tables together. It will, if you have 3, 4, 5, 6 however many tables you have, you can put them together and we'll make up one table that contains everything. But again, I'm actually not going to do that today. So if I go back into here and I close these tables. Let's get started with how virtual join came about. So let's talk about joining tables first. You have to decide what type of join you want to use. So your...if you're tables are small, it might be easiest to do a physical join. To just do a tables join, like the two tables I showed you weren't very big. If you pull in three or four maybe more tables, JMP query builder is a wonderful tool for building a table. And you may want all of your data in the same table so that may be exactly what you want. You just need to be mindful of disk space and performance, and just understand if you have five or six tables that you have sitting separately and then you join them together physically, you're making duplicate copies. So those are the ways that you might determine which which you would use. Virtual join came about in JMP 13 and it was added with the ability to take a link, a common link ID, and join multiple tables together. It's kind of a concept of joining without joining. It saves space and it also saves duplication of data. And so that...in 13 we we started with that. And then in 14 to 15, we added more features, things that customers requested. Link tables with rows synchronize...rows states synchronization. You can shorten column names. We added being able to auto open linked tables. Being able to have a link ID and a link reference on the same column. And we also added these little hover tips that I'll show you where it can tell you which source is your column source table. So those are the things that we added and I'm going to try to set this up and demonstrate it for you. So I've got this data that I actually got from a... it's just an imaginary high-tech firm. And it's it's HR data and it includes things such as compensation, and headcount, and some diversity, and compliance, education history, and other employment factors. And so if you think about it, it's a perfect kind of data to link because you have usually a unique ID variable, such as an employee ID or something that you can link together and maybe have various data for your HR team that's in different places. So I'm going to open up these two tables and just simply walk through what you would do if you were trying to link these together. So this table here is Employee Scores 1 and then I have Compensation Master 1 in the back. These tables, Employees Scores 1 is my source table. And Compensation Master is my referencing table. So you can see these ID, this ID variable here in this table. And it's also in the compensation master table. So I'm going to set up my link ID. So it's a couple of different ways to do this. You can go into column properties. And you can see down here, you have a link ID and reference. The easiest way to do this is with a right click, so there's link ID. And if I look right here, I can see this ID key has been assigned to that column. So then I'm going to go into my compensation master table. And I'm going to go into this column. And again, you can do it with column properties. But you can do the easiest way by going right here to link reference, the table has the ID. So it shows up in this list. I'm going to click on this and voila, there's my link reference icon right there. And I can now see that all the columns that were in this table are...are available to me in this table. You can see you have a large number of columns. You can also see in here that you have...they're kind of long column names, you have the column names, plus this identifier right here which is showing you that this is a referencing column. And so I'm going to run this little simple tabulate I've saved here and just show you very briefly that this is a report and just to simply show you this is a virtual column length of service. And then compensation type is actually part of my compensation table and then gender is a virtual column. So I'm using...building this using virtual columns and also columns that reside in the table. One thing I wanted to point out to you very quickly is that under this little red triangle...let's say you're working with this data and you decide, "Oh, I really want to make this one table. I really want all the columns in one table." There is a little secret tool here called merge reference data. And a lot of people don't know this is there, exactly. But if I wanted to, I could click that and I can merge all the columns into this table. And so, but for time sake, I'm not going to do that right now, but I wanted to point out where that is located. And let me just show you back here in the journal, real quickly. This is possible to do with scripting, so you can set the property link reference and point to your table and list that to use the link columns. So I'm going to close this real quickly and then go back to the same application where I actually had same two tables that I've got some extra saved scripts in here, a couple more things I want to show. So again, I've got employee scores. This is my source table. And then I've got compensation master and they're already linked and you can see this here. So I want to rerun that tabulate and I want to show you something. So you can see that these column names are shorter now. So I want to show what we added in JMP 14. If I right click and bring up the column info dialog, I can see here that it says use linked column names right here. And that sets that these these names will be shorter And that's really a nice feature because when, at the end of the day, when you share this report with someone, they don't really care where the columns are coming from, whether they're in the main table or virtual table. So it's a nice, clean report for you to have. The script is saved so that you can see in the script that it's... it saves the script that shows you a referencing table. So if I look at this, I can see. So you would know where this column is coming from but somebody you're sharing with doesn't necessarily need to know. So I want to show you this other thing that that that we added with this dispatching of row states. Real quick example, I'm going to run this distribution. And you notice right away that in this distribution, I've got a title that says these numbers are wrong. And so let me point out what I'm talking about. Employee scores is my employee database table and it has about 3,600 employees. This is a unique reference to employees and it's a current employee database, let's say. My compensation master table is more like a history table and it has 12,000 rows in it, so it has potentially in this table, multiple references to the same employee, let's say, an employee changed jobs and they got a raise, or they moved around. Or it could have employees that are no longer in the company. So running this report from this table doesn't render the information that I really want. I can see down here that my count is off, got bigger counts, I don't exactly have what I was looking for. So this is one of the reasons why we created this row states synchronization and Kelci is going to talk a little bit more about this in a real life application, too. But I'm just simply going to show you this is how you would set up dispatching row states. So what I'm doing is I'm just batching, selection color marker. And what I'm doing is I'm actually sending from compensation master to employee scores, I'm sending the information to this table because (I'm sorry), this is the table that I want my information to be run from. So if I go back and I rerun that distribution, I now have this distribution (it's a little bit different columns), but I have this distribution. And if I look at the numbers right here, I have the exact numbers of my employee database. And that's exactly what I wanted to see. So you need to be careful with dispatching and accepting and Kelci will speak more to that. But that was just a simple case example of how you would do that. And I will show you real quickly, that there is a Online Help link that shows an example of virtually joining columns and showing row states. It'll step you through that. There's some other examples out here too of using virtual join. If you need more information about setting this up. And again, just to remind you, all of this is scriptable. So you can script this right here, by setting up your row states and the different things that you want with that. So as we moved into JMP 15 we added a couple more things. And so what we added was we we added the ability to auto open a table and also to hover over columns and figure out where they're coming from. And I'll explain what that what that means exactly. So if I click on these. We created some new tables for JMP 15, employeemaster.jmp, which is still part of this HR data. And so if I track this down a little bit and look, a couple things I'll point out about this table. It has a link ID and a link reference. And that was the other thing we added to to JMP 15, the ability to be able to have a link ID and link reference on the same column. So if I look at this and I go and look at my home window here, I can see that there's two more tables that are open. They were opened automatically for me. And so I'm going to open these up because I kind of want to string them out so you can see how this works. But this employee master table is linked to a...stack them on top of each other...it's linked to the education history table, which has been, in turn, linked to my predicted termination table. And you can see there's an employee ID that has a link reference and the link ID, employee ID here. Same thing, and then predict determination has an ID only. And if you had another table or two that had employee ID unique data and you needed to pull it in, you could continue the string on by assigning a link reference here and you can keep on keep on going. So I'm...just to show you quickly, if I right click and look at this column here, I can see that my link ID is set, I can also see my link reference is set. And it tells me education history is a table that this table is linked to. I've got it on auto open and I've got on the shorter names. I'm not dispatching row states, so nothing is set there. So all of the columns that are in these other two tables are available to me, for my referencing table here called employee master. And real quickly, you can see that you have a large number of columns in here that are available to you, and the link columns show up as grouped columns down here. So another question that got asked from customers, as they say, is there any way you can tell us where these columns come from so that is a little clearer? So we added this nice little hover tip. If I hover over this, this tells me that this particular column disability flag is coming from predicted termination. So it's actually coming from the table that's last in my series. And if I go down here and I click on one of these, it says the degree program code is coming from education history. So that's, that's a nice little feature that will kind of help you as you're picking out your columns, maybe in what you're trying to run with platforms and so forth. But if I run this distribution, this is just a simple distribution example that's showing that employee level is actually coming from my employee master table. This degree description is coming from education history table and this performance eval is coming from my predictive termination table. And then you can look some more with some of these other examples that are in here. I did build a context window of dashboards here that shows a Graph Builder showing a box plot. We have a distribution in here, a tabulate and a heat map, using all virtual columns, some, you know, some columns that are from the table, but also virtual columns got a filter. So if I want to look at females and look at professionals. I always like to point out the the oddities here. So if I go in here and look at these two little places that are kind of hanging out here. This is very interesting to me because comp ratios shows how people are paid. Basically, whether they're paid in in the right ratio or not it for their job description. And it looks like these two outliers are consistently exceeding expectations, that looks like they're maybe underpaid. So just like this one up here is all by itself and it looks like they seldom meet their expectations, but they may be slightly overpaid and or they could be mistakes. But at any rate, as you zero in on those, you can also see that the selections are being made here. So, in this heat map, I can tell that there is some performance money that's being spent and training dollars. so maybe train that person. So that's actually good good good to see So that is about all I wanted to show. I did want to show this one thing, just to remind, just to reiterate. Education history has access to the columns that are in predicted termination. And so those two tables can talk to each other separately. And if I run this graph script, I have similar performance and training dollars, but I'm looking at like grade point average, class rank, as to where people fall into the limits here using combinations of columns from just those two tables. So I'm going to pass this on. I believe that was the majority of what I wanted to share. I'm going to stop sharing my screen. And I will pass this back to Kelci and she will take it from here. Kelci J. Miclaus Thanks, Mandy. Mandy said we've given this talk now a couple times and, really it was this combined effort of me working in my group, which is life sciences for the JMP Clinical and JMP Genomics vertical solutions, and finding such perfect examples of where I could really leverage virtual joins and working closely with the development team on how those features were released in the last few versions of JMP. And so for this section I will go through some of the examples, specific to our clinical research and how we've really leveraged this talking table idea around row state synchronization. So as as Mandy mentioned this is now, and if we have time towards the end, this, this idea of virtual joins with row state synchronization is now the entire architecture that drives how JMP Clinical reports and reviews are used for assessing early efficacy and safety and clinical trials reports with our customers. And one of the reasons it fits so well is because of the formatting of typical clinical trial data. So the data example that I'm going to use for all of the examples I have around row state synchronization or row state propagation as I sometimes call it, are example data from a clinical trial that has about 900 patients. It was a real clinical trial carried out about 20-30 years ago looking at subarachnoid hemorrhage and treatment of nicardipine on these patients. The great thing about clinical data is we work with very standard normalized data structures, meaning that each component of a clinical trial is collected, similar to the HR data that Mandy showed...show...showed us is normalized, so that each table has its own content and we can track that separately, but then use virtual joins to create comprehensive stories. So the three data sets I'll walk through are this demography table which has about a little under 900 patients of clinical trials, where here we have one row per patient in our clinical trial. And this is called the demography, that will have information about their birth, age, sex, race, what treatment they were given, any certain flags of occurrences that happened to them during the clinical trial. Similarly, we can have separate tables. So in a clinical trial, they're typically collecting at each visit what adverse events have happened to a patient while on on a new drug or study. And so this is a table that has about 5,000 records. We still have this unique subject identifier, but we have duplications, of course. So this records every event or adverse event that was reported for each of the patients in our clinical trial. And finally I'll also use a laboratory data set or labs data set, which also follows the similar type of record stacked format that we saw on the adverse events. Here we're thinking of the regular visits, where they take several laboratory measurements and we can track those across the course of the clinical trial to look for abnormalities and things like that. So these three tables are very a standard normalized format of what's called the international CDISC standard for clinical trial data. And it suits us so well towards using the virtual join. Aas Mandy has said, it is easy to, you know, create a merge table of labs. But here we have 6,000 records of labs and merging in our demography, it would cause a duplication of all of their single instances of their demographic descriptions. And so we want to set up a virtual join with this, which we can do really easily. If we create in our demography table, we're going to set up unique subject identifier as our link ID. And then very quickly, because we typically would want to look at laboratory results and use something like the treatment group they are on to see if there's differences in the laboratories, we can now reference that data and create visualizations or reports that will actually assess and look at treatment group differences in our laboratory results. And so we didn't have to make the merge. We just gained access to these...this planned arm column from our demography table through that simple two-step setting up the column properties of a virtual join. It's also very easy to then look at like lab abnormalities. So here's a plot by each of the different arms or treatment groups who had abnormally high lab tests across visits in a clinical trial. We might also want to do this same type of analysis with our adverse event, which we would also want to see if there's different occurrences in the adverse events between those treatment groups. So once again we can also link this table to our referenced demography and very quickly create counts of the distribution of adverse events that occur separately for, say, a nicardipine, the active treatment, versus a placebo. So now we want them to really talk. And so the next two examples that I want to show with these data are the row state synchronization options we have. So you quickly saw from Mandy's portion that she showed that on the column properties we have the ability to synchronize row states now between tables. Which is really why our talk is called talking tables, because that's the way they can communicate now. And you can either dispatch row states, meaning the table that you're set up the reference to some link ID can send information from that table back to its reference ID table. And I'll walk through a quick example, but as mentioned...as Mandy mentioned, this is by far the more dangerous case sometimes because it's very easy to hit times when you might get inconclusive results, but I'm going to show a case where it works and where it's useful. As you've noticed, just from this analysis, say with the adverse events, it was very easy as the table that we set up a link reference to (the ID table) to gain access to the columns and look at the differences of the treatment groups in this table. There's not really anything that goes the other way though. As Mandy had said, you wouldn't want to use this new join table to look at a distribution of, say, that treatment group, because what you actually have here is numbers that don't match. It looks like there's 5,000 subjects when really, if you go back to our demography table, we have less than 900. So here's that true distribution of about the 900 subjects by treatment group with all their other distributions. Now, there is the time, though, that this table is what you want to use as your analysis table or the goal of where you're going to create an analysis. And you want to gain information from those tables that are virtually linked to it. The laboratory, for example, and the adverse events. So here we're going to actually use this table to create a visualization that will annotate these subjects in this table with anyone who had an abnormal lab test or a serious adverse event. And now I've cheated, because I've prepared this data. You'll notice in my adverse events data I've already done the analysis to find any case of subjects that were...any adverse events that were considered serious and I've used the row state marker to annotate those records that had...were a serious adverse event. Similarly, in the labs data set, I've used red color to annotate...annotate any of the lab results that were abnormally high. So for example, we can see all of those that had high abnormalities. I've colored red most of this through, just row state selection and then controlling the row states. So with this data where I have these two row states in place, we can go back to our demography table and create a view that is a distribution by site of the ages of our patients in a clinical trial. And now if we go back to each of the linked tables, we can control bringing in this annotated information with row state synchronization. So we're going to change this option here from row states with reference table to none, to actually to dispatch and in this case I want to be careful. The only thing I want this table to tell that link reference table is a marker set. I'm going to click Apply And you'll notice automatically my visualization that I created off that demography table now has the markers of any subjects who had experienced an adverse event from that other table. We can do the same now with labs. Choose to dispatch. In this case, we only want to dispatch color. And now, just by controlling column properties, we're at a place where we have a visualization or an analysis built off our demography table that has gained access to the information from these virtually joined tables using the dispatch row state synchronization or propagation. So that's really cool. I think it's a really powerful feature. But there are a lot of gotchas and things you should be careful with with the dispatch option. Namely the entire way virtual joins work is the link ID table, the date...the data table you set up, in this case demography, is one row per ID and you're using that to merge or virtually join into a data table that has many copies of that usage ID. So we're making a one-to-many; that's fine. Dispatch makes a many-to-one conversation. So in in the document we have an ...in the resource provided with this video, there's a lot of commentary about carefully using this. It shouldn't be something that's highly interactive. If you then decide to change row states, it can be very easy for this to get confusing or nonsensical, that, say if I've marked both with color and marker, it wouldn't know what to do because it was some rows might be saying, "Color this red," but the other linked table might be saying color it blue or black. So you have to be very careful about not mixing and matching and not being too interactive with with that many-to-one merge idea. But in this example, this was a really, really valuable tool that would have required quite a lot of data manipulation to get to this point. So I'm going to close down these examples of the dispatch virtual join example and move on to likely what's going to be more commonly used is the accept... acceptance row state of the virtual join talking tables. And for this case, I'm actually going to go through this with a script. So instead of interactively walking me through the virtual join and row state column properties, we're going to look at this scripting results of that. And the example here, what we wanted to do, is be able to use these three tables (again, the demography, adverse events and laboratory data in a clinical trial) to really create what they call a comprehensive safety profile. And this is really the justification and rationale of our use in JMP Clinical for our customers. This idea that we want to be able to take these data sets, keep them separate but allow them to be used in a comprehensive single analysis so they don't feel separate. So with this example, we want to be able to open up our demography and set it up as a link ID. So this is similar to what I just did interactively that will create the demographic table and create the link ID column property on unique subject identifier. So we're done there. You see the key there that shows that that's now the link ID property. We then want to open up the labs data set. And we're going to set a property on the unique subject identifier in that table to use the link reference to the demography table. And a couple of the options and the options here. We want to show that that property of using shorter names. Use the linked column name to shorten the name of our columns coming from the demography table into the labs table. And here we want to set up row state synchronization as an acceptance of select, exclude and hide. And we're going to do this also for the AE table. So I'll run both of these next snippets of code, which will open up my AE and my lab table. And now you'll see that instead of that dispatch the properties here are said to set to accept with these select, exclude and hide. And similarly the adverse events table has the exact same acceptance. So in this case now, instead of this dispatch, which we were very careful to only dispatch one type of row state from one table and another from another table back to our link ID reference table. Here we're going to let our link ID reference table demography broadcast what happens to it to the other two tables And that's what accept does. So it's going to accept row states from the demography table. And I've cheated a little bit that I actually just have a script attached to our demography table here that is really just setting up some of the visualizations that I've already shown that are scripts attached to each of the table in a single window. And so here we have what you could consider your safety profile. We have distributions of the patient demographic information. So this is sourced from the demography table. You see the correct numbers of the counts of the 443 patients on placebo versus the 427 on nicardipine.  
Monday, October 12, 2020
Russ Wolfinger, Director of Scientific Discovery and Genomics, Distinguished Research Fellow, JMP Mia Stephens, Principal Product Manager, JMP   The XGBoost add-in for JMP Pro provides a point-and-click interface to the popular XGBoost open-source library for predictive modeling with extreme gradient boosted trees. Value-added functionality includes: •    Repeated k-fold cross validation with out-of-fold predictions, plus a separate routine to create optimized k-fold validation columns •    Ability to fit multiple Y responses in one run •    Automated parameter search via JMP Design of Experiments (DOE) Fast Flexible Filling Design •    Interactive graphical and statistical outputs •    Model comparison interface •    Profiling Export of JMP Scripting Language (JSL) and Python code for reproducibility   Click the link above to download a zip file containing the journal and supplementary material shown in the tutorial.  Note that the video shows XGBoost in the Predictive Modeling menu but when you install the add-in it will be under the Add-Ins menu.   You may customize your menu however you wish using View > Customize > Menus and Toolbars.   The add-in is available here: XGBoost Add-In for JMP Pro      Auto-generated transcript...   Speaker Transcript Russ Wolfinger Okay. Well, hello everyone.   Welcome to my home here in Holly Springs, North Carolina.   With the Covid craziness going on, this is kind of a new experience to do a virtual conference online, but I'm really excited to talk with you today and offer a tutorial on a brand new add in that we have for JMP Pro   that implements the popular XGBoost functionality.   So today for this tutorial, I'm going to walk you through kind of what we've got we've got. What I've got here is a JMP journal   that will be available in the conference materials. And what I would encourage you to do, if you'd like to follow along yourself,   you could pause the video right now and go to the conference materials, grab this journal. You can open it in your own version of JMP Pro,   and as well as there's a link to install. You have to install an add in, if you go ahead and install that that you'll be able to reproduce everything I do here exactly at home, and even do some of your own playing around. So I'd encourage you to do that if you can.   I do have my dog Charlie, he's in the background there. I hope he doesn't do anything embarrassing. He doesn't seem too excited right now, but he loves XGBoost as much as I do so, so let's get into it.   XGBoost is a it's it's pretty incredible open source C++ library that's been around now for quite a few years.   And the original theory was actually done by a couple of famous statisticians in the '90s, but then the University of Washington team picked up the ideas and implemented it.   And it...I think where it really kind of came into its own was in the context of some Kaggle competitions.   Where it started...once folks started using it and it was available, it literally started winning just about every tabular competition that Kaggle has been running over the last several years.   And there's actually now several hundred examples online if you want to do some searching around, you'll find them.   So I would view this as arguably the most popular and perhaps the most powerful tabular data predictive modeling methodology in the world right now.   Of course there's competitors and for any one particular data set, you may see some differences, but kind of overall, it's very impressive.   In fact, there are competitive packages out there now that do very similar kinds of things LightGBM from Microsoft and Catboost from Yandex. We won't go into them today, but pretty similar.   Let's, uh, since we don't have a lot of time today, I don't want to belabor the motivations. But again, you've got this journal if you want to look into them more carefully.   What I want to do is kind of give you the highlights of this journal and particularly give you some live demonstrations so you've got an idea of what's here.   And then you'll be free to explore and try these things on your own, as time goes along. You will need...you need a functioning copy of JMP Pro 15   at the earliest, but if you can get your hands on JMP 16 early adopter, the official JMP 16 won't ship until next year, 2021,   but you can obtain your early adopter version now and we are making enhancements there. So I would encourage you to get the latest JMP 16 Pro early adopter   in order to obtain the most recent functionality of this add in...of this functionality. Now it's, it's, this is kind of an unusual new frame setup for JMP Pro.   We have written a lot of C++ code right within Pro in order to integrate to get the XGBoost C++ API. And that's why...we do most of our work in C++   but there is an add in that accompanies this that installs the dynamic library and does does a little menu update for you, so you need...you need both Pro and you need to install the add in, in order to run it.   So do grab JMP 16 pro early adopter if you can, in order to get the latest things and that's what I'll be showing today.   Let's, let's dive right to do an example. And this is a brand new one that just came to my attention. It's got an a very interesting story behind it.   The researcher behind these data is a professor named Jaivime Evarito. He is a an environmental scientist expert, a PhD, professor, assistant professor   now at Utrecht University in the Netherlands and he's kindly provided this data with his permission, as well as the story behind it, in order to help others that were so...a bit of a drama around these data. I've made his his his colleague Jeffrey McDonnell   collected all this data. These are data. The purpose is to study water streamflow run off in deforested areas around the world.   And you can see, we've got 163 places here, most of the least half in the US and then around the world, different places when they were able to collect   regions that have been cleared of trees and then they took some critical measurements   in terms of what happened with the water runoff. And this kind of study is quite important for studying the environmental impacts of tree clearing and deforestation, as well as climate change, so it's quite timely.   And happily for Jaivime at the time, they were able to publish a paper in the journal Nature, one of the top science journals in the world, a really nice   experience for him to get it in there. Unfortunately, what happened next, though there was a competing lab that really became very critical of what they had done in this paper.   And it turned out after a lot of back and forth and debate, the paper ended up being retracted,   which was obviously a really difficult experience for Jaivime. I mean, he's been very gracious and let and sharing a story and hopes.   to avoid this. And it turns out that what's at the root of the controversy, there were several, several other things, but what what the main beef was from the critics is...   I may have done a boosted tree analysis. And it's a pretty straightforward model. There's only...we've only got maybe a handful of predictors, each of which are important but, and one of their main objectives was to determine which ones were the most important.   He ran a boosted tree model with a single holdout validation set and published a validation hold out of around .7. Everything looked okay,   but then the critics came along, they reanalyzed the data with a different hold out set and they get a validation hold out R square   of less than .1. So quite a huge change. They're going from .7 to less than .1 and and this, this was used the critics really jumped all over this and tried to really discredit what was going on.   Now, Jaivime, at this point, Jaivime, this happened last year and the paper was retracted earlier here in 2020...   Jaivime shared the data with me this summer and my thinking was to do a little more rigorous cross validation analysis and actually do repeated K fold,   instead of just a single hold out, in order to try to get to the bottom of this this discrepancy between two different holdouts. And what I did, we've got a new routine   that comes with the XGBoost add in that creates K fold columns. And if you'll see the data set here, I've created these. For sake of time, we won't go into how to do that. But there is   there is a new module now that comes with the heading called make K fold columns that will let you do it. And I did it in a stratified way. And interestingly, it calls JMP DOE under the hood.   And the benefit of doing it that way is you can actually create orthogonal folds, which is not not very common. Here, let me do a quick distribution.   That this was the, the holdout set that Jaivime did originally and he did stratify, which is a good idea, I think, as the response is a bit skewed. And then this was the holdout set that the critic used,   and then here are the folds that I ended up using. I did three different schemes. And then the point I wanted to make here is that these folds are nicely kind of orthogonal,   where we're getting kind of maximal information gain by doing K fold three separate times with kind of with three orthogonal sets.   So, and then it turns out, because he was already using boosted trees, so the next thing to try is the XGBoost add in. And so I was really happy to find out about this data set and talk about it here.   Now what happened...let me do another analysis here where I'm doing a one way on the on the validation sets. It turns out that I missed what I'm doing here is the responses, this water yield corrected.   And I'm plotting that versus the the validation sets. it turned out that Jaivime in his training set,   the top of the top four or five measurements all ended up in his training set, which I think this is kind of at the root of the problem.   Whereas in the critics' set, they did...it was balanced, a little bit more, and in particular the worst...that the highest scoring location was in the validation set. And so this is a natural source for error because it's going beyond anything that was doing the training.   And I think this is really a case where the K fold, a K fold type analysis is more compelling than just doing a single holdout set.   I would argue that both of these single holdout sets have some bias to them and it's better to do more folds in which you stratify...distribute things   differently each time and then see what happens after multiple fits. So you can see how the folds that I created here look in terms of distribution and then now let's run XGBoost.   So the add in actually has a lot of features and I don't want to overwhelm you today, but again, I would encourage you to follow along and pause the video at places if you if you are trying to follow along yourself   to make sure. But what we did here, I just ran a script. And by the way, everything in the journal has...JMP tables are nice, where you can save scripts. And so what I did here was run XGBoost from that script.   Let me just for illustration, I'm going to rerun this again right from the menu. This will be the way that you might want to do it. So the when you install the add in,   you hit predictive modeling and then XGBoost. So we added it right here to the predictive modeling menu. And so the way you would set this up   is to specify the response. Here's Y. There are seven predictors, which we'll put in here as x's and then you put their fold columns and validation.   I wanted to make a point here about those of you who are experienced JMP Pro users, XGBoost handles validation columns a bit differently than other JMP platforms.   It's kind of an experimental framework at this point, but based on my experience, I find repeated K fold to be very a very compelling way to do and I wanted to set up the add in to make it easy.   And so here I'm putting in these fold columns again that we created with the utility, and XGBoost will automatically do repeated K fold just by specifying it like we have here.   If you wanted to do a single holdout like the original analyses, you can set that up just like normal, but you have to make the column   continuous. That's a gotcha. And I know some of our early adopters got tripped up by this and it's a different convention than other   Other XGBoost or other predictive modeling routines within JMP Pro, but this to me seemed to be the cleanest way to do it. And again, the recommended way would be to run   repeated K fold like this, or at least a single K fold and then you can just hit okay. You'll get this initial launch window.   And the thing about XGBoost, is it does have a lot of tuning parameters. The key ones are listed here in this box and you can play with these.   And then it turns out there are a whole lot more, and they're hidden under this advanced options, which we don't have time at all for today.   But we have tried to...these are the most important ones that you'll typically...for most cases you can just worry about them. And so what...let's let's run the...let's go ahead and run this again, just from here you can click the Go button and then XGBoost will run.   Now I'm just running on a simple laptop here. This is a relatively small data set. And so right....we just did repeated, repeated fivefold   three different things, just in a matter of seconds. XGBoost is pretty well tuned and will work well for larger data sets, but for this small example, let's see what happened.   Now it turns out, this initial graph that comes out raises an immediate flag.   What we're looking at here is the...over the number of iterations, the fitting iterations, we've got a training curve which is the basically the loss function that you want to go down.   But then the solid curve is the validation curve. And you can see what happened here. Just after a few iterations occurred this curve bottomed out and then things got much worse.   So this is actually a case where you would not want to use this default model. XGBoost is already overfited,   which often will happen for smaller data sets like this and it does require the need for tuning.   There's a lot of other results at the bottom, but again, they wouldn't...I wouldn't consider them trustworthy. At this point, you would need...you would want to do a little tuning.   For today, let's just do a little bit of manual tuning, but I would encourage you. We've actually implemented an entire framework for creating a tuning design,   where you can specify a range of parameters and search over the design space and we again actually use JMP DOE.   So it's a...we've got two different ways we're using DOE already here, both of which have really enhanced the functionality. For now, let's just do a little bit of manual tuning based on this graph.   You can see if we can...zoom in on this graph and we see that the curve is bottoming out. Let me just have looks literally just after three or four iterations, so one thing, one real easy thing we can do is literally just just, let's just stop, which is stop training after four steps.   And see what happens. By the way, notice what happened for our overfitting, our validation R square was actually negative, so quite bad. Definitely not a recommended model. But if we run we run this new model where we're just going to take four...only four steps,   look at what happens.   Much better validation R square. We're now up around .16 and in fact we let's try three just for fun. See what happens.   Little bit worse. So you can see this is the kind of thing where you can play around. We've tried to set up this dialogue where it's amenable to that.   And you can you can do some model comparisons on this table here at the beginning helps you. You can sort by different columns and find the best model and then down below, you can drill down on various modeling details.   Let's stick with Model 2 here, and what we can do is...   Let's only keep that one and you can clean up...you can clean up the models that you don't want, it'll remove the hidden ones.   And so now we're down, just back down to the model that that we want to look at in more depth. Notice here our validation R square is .17 or so.   So, which is, remember, this is actually falling out in between what Jaivime got originally and what the critic got.   And I would view this as a much more reliable measure of R square because again it's computed over all, we actually ran 15 different modeling fits,   fivefold three different times. So this is an average over those. So I think it's a much much cleaner and more reliable measure for how the model is performing.   If you scroll down for any model that gets fit, there's quite a bit of output to go through. Again...again, JMP is very good about...we always try to have graphics near statistics that you can   both see what's going on and attach numbers to them and these graphs are live as normal, nicely interactive.   But you can see here, we've got a training actual versus predicted and validation. And things almost always do get worse for validation, but that's really what the reality is.   And you can see again kind of where the errors are being made, and this is that this is that really high point, it often gets...this is the 45 degree line.   So that that high measurement tend...and all the high ones tend to be under predicted, which is pretty normal. I think for for any kind of method like this, it's going to tend to want to shrink   extreme values down and just to be conservative. And so you can see exactly where the errors are being made and to what degree.   Now for Jaivime's key interest, they were...he was mostly interested in seeing which variables were really driving   this water corrected effect. And we can see the one that leaps out kind of as number one is this one called PET.   There are different ways of assessing variable importance in XGBoost. You can look at straight number of splits, as gain measure, which I think is   maybe the best one to start with. It's kind of how much the model improves with each, each time you split on a certain variable. There's another one called cover.   In this case, for any one of the three, this PET is emerging as kind of the most important. And so basically this quick analysis that that seems to be where the action is for these data.   Now with JMP, there's actually more we can do. And you can see here under the modeling red triangle we've we've embellished quite a few new things. You can save predictive values and formulas,   you can publish to model depot or formula depot and do more things there.   We've even got routines to generate Python code, which is not just for scoring, but it's actually to do all the training and fitting, which is kind of a new thing, but will help those of you that want to transition from from JMP Pro over to Python. For here though, let's take a look at the profiler.   And I have to have to offer a quick word of apology to my friend Brad Jones in an earlier video, I had forgotten to acknowledge that he was the inventor of the profiler.   So this is, this is actually a case and kind of credit to him, where we're we're using it now in another way, which is to assess variable importance and how that each variable works. So to me it's a really compelling   framework where we can...we can look at this. And Charlie...Charlie got right up off the couch when I mentioned that. He's right here by me now.   And so, look at what happens...we can see the interesting thing is with this PET variable, we can see the key moment, it seems to be as soon as PET   gets right above 1200 or so is when things really take off. And so it's a it's a really nice illustration of how the profiler works.   And as far as I know, this is the first time...this is the first...this is the only software that offers   plots like this, which kind of go beyond just these statistical measures of importance and show you exactly what's going on and help you interpret   the boosted tree model. So really a nice, I think, kind of a nice way to do the analysis and I'd encourage that...and I'd encourage you try this out with your own data.   Let's move on now to a other example and back to our journal.   There's, as you can tell, there's a lot here. We don't have time naturally to go through everything.   But we've we've just for sake of time, though, I wanted to kind of show you what happens when we have a binary target. What we just looked at was continuous.   For that will use the old the diabetes data set, which has been around quite a while and it's available in the JMP sample library. And what this this data set is the same data but we've embellished it with some new scripts.   And so if you get the journal and download it, you'll, you'll get this kind of enhanced version that has   quite a few XGBoost runs with different with both a binary ordinal target and, as you remember, what this here we've got low and high measurements which are derived from this original Y variable,   looking at looking at a response for diabetes. And we're going to go a little bit even further here. Let's imagine we're in a kind of a medical context where we actually want to use a profit matrix. And our goal is to make a decision. We're going to predict each person,   whether they're high or low but then I'm thinking about it, we realized that if a person is actually high, the stakes are a little bit higher.   And so we're going to kind of double our profit or or loss, depending on whether the actual state is high. And of course,   this is a way this works is typically correct...correct decisions are here and here.   And then incorrect ones are here, and those are the ones...you want to think about all four cells when you're setting up a matrix like this.   And here is going to do a simple one. And it's doubling and I don't know if you can attach real monetary values to these or not. That's actually a good thing if you're in that kind of scenario.   Who knows, maybe we can consider these each to be a BitCoic, to be maybe around $10,000 each or something like that.   Doesn't matter too much. It's more a matter of, we want to make an optimal decision based on our   our predictions. So we're going to take this profit matrix into account when we, when we do our analysis now. It's actually only done after the model fitting. It's not directly used in the fitting itself.   So we're going to run XGBoost now here, and we have a binary target. If you'll notice the   the objective function has now changed to a logistic of log loss and that's what reflected here is this is the logistic log likelihood.   And you can see...we can look at now with a binary target the the metrics that we use to assess it are a little bit different.   Although if you notice, we do have profit columns which are computed from the profit matrix that we just looked at.   But if you're in a scenario, maybe where you don't want to worry about so much a profit matrix, just kind of straight binary regression, you can look at common metrics like   just accuracy, which is the reverse of misclassification rate, these F1 or Matthews correlation are good to look at, as well as an ROC analysis, which helps you balance specificity and sensitivity. So all of those are available.   And you can you can drill down. One brand new thing I wanted to show that we're still working on a bit, is we've got a way now for you to play with your decision thresholding.   And you can you can actually do this interactively now. And we've got a new ... a new thing which plots your profit by threshold.   So this is a brand new graph that we're just starting to introduce into JMP Pro and you'll have to get the very latest JMP 16 early adopter in order to get this, but it does accommodate the decision matrix...   or the profit matrix. And then another thing we're working on is you can derive an optimal threshold based on this matrix directly.   I believe, in this case, it's actually still .5. And so this is kind of adds extra insight into the kind of things you may want to do if your real goal is to maximize profit.   Otherwise, you're likely to want to balance specificity and sensitivity giving your context, but you've got the typical confusion matrices, which are shown here, as well as up here along with some graphs for both training and validation.   And then the ROC curves.   You also get the same kind of things that we saw earlier in terms of variable importances. And let's go ahead and do the profiler again since that's actually a nice...   it's also nice in this, in this case. We can see exactly what's going on with each variable.   We can see for example here LTG and BMI are the two most important variables and it looks like they both tend to go up as the response goes up so we can see that relationship directly. And in fact, sometimes with trees, you can get nonlinearities, like here with BMI.   It's not clear if that's that's a real thing here, we might want to do more, more analyses or look at more models to make sure, maybe there is something real going on here with that little   bump that we see. But these are kind of things that you can tease out, really fun and interesting to look at.   So, so that's there to play with the diabetes data set. The journal has a lot more to it. There's two more examples that I won't show today for sake of time, but they are documented in detail in the XGBoost documentation.   This is a, this is just a PDF document that goes into a lot of written detail about the add in and walk you step by step through these two examples. So, I encourage you to check those out.   And then, the the journal also contains several different comparisons that have been done.   You saw this purple purple matrix that we looked at. This was a study that was done at University of Pennsylvania,   where they compare a whole series of machine learning methods to each other across a bunch of different data sets, and then compare how many times one, one outperform the other. And XGBoost   came out as the top model and this this comparison wasn't always the best, but on average it tended to outperform all the other ones that you see here. So, yet some more evidence of the   power and capabilities of this of this technique. Now there's some there's a few other things here that I won't get into.   This Hill Valley one is interesting. It's a case where the trees did not work well at all. It's kind of a pathological situation but interesting to study, just so you just to help understand what's going on.   We also have done some fairly extensive testing internally within R&D at JMP and a lot of those results are here across several different data sets.   And again for sake of time, I won't go into those, but I would encourage you to check them out. They do...all of our results come along with the journal here and you can play with them across quite a few different domains and data sizes.   So check those out. I will say just just for fun, our conclusion in terms of speed is summarized here in this little meme. We've got two different cars.   Actually, this really does scream along and it it's tuned to utilize all of the...all the threads that you have in your GPU...in your CPU.   And if you're on Windows, with an NVIDIA card, you can even tap into your GPU, which will often offer maybe another fivefold increase in speed. So a lot of fun there.   So let me wrap up the tutorial at this point. And again, encourage you to check it out. I did want to offer a lot of thanks. Many people have been involved   and I worry that actually, I probably I probably have overlooked some here, but I did at least want to acknowledge these folks. We've had a great early adopter group.   And they provided really nice feedback from Marie, Diedrich and these guys at Villanova have actually already started using XGBoost in a classroom setting with success.   So, so that was really great to hear about that. And a lot of people within JMP have been helping.   Of course, this is building on the entire JMP infrastructure. So pretty much need to list the entire JMP division at some point with help with this, it's been so much fun working on it.   And then I want to acknowledge our Life Sciences team that have kind of kept me honest on various things. And they've been helping out with a lot of suggestions.   And Luciano actually has implemented an XGBoost add in, a different add in that goes with JMP Genomics, so I'd encourage you to check that out as well if you're using JMP Genomics. You can also call XGBoost directly within the predictive modeling framework there.   So thank you very much for your attention and hope you can get XGBoost to try.
Scott Wise, Senior Manager, JMP Education Team, SAS   The power of using Text Mining is a great tool in investigating all kinds of unstructured text that commonly resides in our collected data. From notes captured on warranty issues, lab testing/experimental comments, to even looking at food recipes, this new method opens a lot of opportunity to better understand our world. In this presentation, we will show how to use the latest text analytic methods to help solve a family mystery as to the regional source of my Grandma’s delicious chili recipe. Along the way, we will see how to use text mining to create leading terms and phrase lists and word cloud reports. Then we will utilize the resulting document term matrix to perform topic analysis (via latent class analysis clustering) that will enable us to find a solution to our question. You will be left with an understanding of the powerful text mining approaches that you can add to your own toolbox and start solving your own text data challenges!     Auto-generated transcript...   Speaker Transcript Scott Wise Investigative Text Exploration. My name is Scott Wise and I'm Senior Manager of the JMP Education Team.   And I've got a really fun presentation for us to view today and the goals of this presentation are to give you a little more familiarity   with the capabilities and how easy it is to do text exploration in JMP and JMP Pro, as well as show you a different way of looking at text exploration, like, can I do with to investigate something like a detective would?   Okay, so we're going to talk about my grandma's chili and how that relates to a Texas chili cook off. Before we begin,   let's just debunk some of the terminology that is around text mining, and to me, there's really five simple steps. You spend your time summarizing the data, literally, finding out what words in text occur the most often, even what combination of words occur the most often.   So the ??? call this looking through unstructured text right and then this, this could be anything. This could be a sentence your customer gave you about the performance of your product or your service. It could be...   it could be social media, right, where you know you figure out what people like or dislike based on comments,   something about your product or service or it could be this guy, something like a recipe.   I've even seen it done on patents when people are researching what are popular things people are applying patents for and should we be doing these.   So really, summarizing finding out what words out of that are the most important. Now there's some preparation that comes next, which is just getting down to the smaller list of the words you care about.   Then of course we wouldn't be JMP if we aren't going to visualize it and analyze it and as well we can model it. We can even do some advanced modeling. So I'm going to be using JMP Pro to do this, version 15.1,   and I will be sure to point out what you can do in JMP and where I actually throw in a little bit of Pro that help further answer my question.   document, corpus and term. Document is going to be those...   those things you're analyzing, basically the individual body of text, each row of the text. So it's my my recipes for my example.   It could be different customers who have commented back to you on your product or service. Corpus is actually that unstructured text that you're trying to handle, the body of that text which you're going to analyze.   And then the terms are going to be those words or those combination of words that you care about that can help you answer a question. So let's talk about the story. You know, why did I come up with looking at chili recipes? Well, this comes back to   almost 25 years ago when I first moved to Texas and I came to work for a big company and like most big companies in Texas, they had a big Texas chili cook off.   And I was asked to participate. Everybody participated. But I was also asked to do even more than that, I was asked to be a judge.   So I should have turned down the judge. I thought it was an honor for the new guy, but I think I was the only one they could give it to, I found out why.   But let's talk first about the reaction to the chili I brought. So I had chili recipe that my family's always used and it came from my grandmother, Grandma Lillian.   This is not chili. This is not what Texans considered chili.   Now the good news was I'd have to bring any home because I enjoy eating it. They considered at something else besides chili. Their chili didn't have beans in it. Their chili was not hearty. It was very hot, very soupy.   Mine had beef. Theirs had mostly pork. They had all in. Here's the other thing that got me in trouble. Not only did they not like my grandma's chili, I almost didn't survive the judging   because the real badge of honor of a real Texan is to make the hottest bowl of chili. So you want to beat your neighbor. You want to beat your coworker. So they were throwing all kinds of ungodly hot chiles and spices in here.   And I just thought, almost put a hole my stomach just trying to taste it and you're drinking all this milk, trying to put the heat out. So   it was a baptism by fire. So my recommendation is unless you like the heat, don't enter in as a judge, right. Not a good idea. But I wanted to place; that always bothered me.   What, why didn't my grandma's chili do well? Are there really different types of chili? Where do they come from?   And it turns out chili's got a really cool history. You can see some actually really cool history blogs and papers on it and it most likely came out of   Central America, mostly Mexico. And there are some light dishes, but in San Antonio, it was first observed being sold kind of in the state, we would know it now is chili,   on the on the old San Antonio square there by the Alamo. And it was used on cattle drives and it started to get popular and then it worked its way up the middle of the country all up through the Midwest.   And I was told that, you know, many food innovations got created at the St. Louis World's Fair, you know, really took off when they were shown, this was one of them. Chili made it into the   St. Louis World's Fair and it really got popular.   So there's different varieties. And so the idea was if I...could I use text exploration on ingredients and recipes? Just take the whole recipe and dump it in, see what happens.   And what do I want to compare it to? So I looked up what the traditional regions were; we had several, all the way from Texas up to Michigan.   So in Texas, we've got different varieties. Texas bowl red is the one I was tasting. That's chili con carne. A   Frito pie is something you'll get at a football game, but very popular, that Louisiana has their own version. New Mexico, with green chilies and chicken, have their own version. Oklahoma, Kansas City, Springfield,   Illinois, Cincinnati with the skyline chili hey put it over spaghetti. Michigan. Coney Island,the hot dogs, you know, the chili sauce they put it on there is serious chili.   White chili, unknown where it comes from. Vegetarian chili. But there's a lot of styles that are out there now.   So I said, if I took three recipes from each of these styles and I compared it to control my grandma's chili, can I find what I'm looking for?   So we're going to do this and I'm going to show you, as I walk through the steps, I'm going to show you a summary of the steps then I'll go right behind and show you how I do it in JMP.   First steps going to be summarizing the data. We want to find those words we care about.   Now what happens when you enter in   any of this text analytics, this text data, like my ingredients, into a text explorer, it's going to run it through a library   of regular expressions, and think about the word like regular, just things that are just part of everyday speech.   And they're not that helpful. It's not helpful if I have "the" and "and" in my list. So it tries to help pull out words you don't care about. And yes, it can be customized.   And it's got a pretty strong one built in already, and that's what I used. And then stemming. Stemming is where you go and   say how you want to treat like words -- so "dice" is "dice," "dicing," you got plurals, different...different...   different versions of the root word. Do you want to just count it all in the root word or do you want it separated? So that's a consideration.   And then after we summarize. Let me get these words. You can see I've got a little list here of words. The one behind is unedited and then the one in the front   is actually one where we've gone through and kind of sized down that word list. So what you do is, you look for words you don't care about, you call them stop words, basically say, remove these. Add them back to my library of things I don't care about.   You can also bring over phrases that really matter. And once you have these, we are going to be able to visualize some.   And we're going to be able to see, in my case, I got really interested in ingredients. So we're going to be able to see what ingredients came out the most often.   And you often see word clouds here, and word clouds, you know, the bigger the word, the more frequency it is.   And you can look at it in a cloud, just as just a...just a sporadic layout or you can look at an order where, you know, the first thing up there is the biggest word. That's like the   graph that's in front of your slide. And after that, we are going to do something with it. And so we can create some basic models.   And one of the easiest ways to do that is, once we know what words on our list that we care about,   we can add it back to our data table. It's called saving out the document term matrix. In this case, a simple way of doing it is just binaries. So I've got a separate column here, you can see to my right, where   "onion" has a one where it's in that rows, you know, recipe. It has zero if it's not, so you can get zeros and ones here in columns. And you can see what's the most important.   And a lot of times you're trying to model for something like, if I was stuck,   maybe I had the judging scores by this and I can say, well, here's a numeric score tied to each recipe. I want to see what ingredients   are common that result in a high score. That might be something I'm doing, so try to do it a little more predictive.   But in this case, I'm kind of going to look at grouping. I'm really interested in grouping, kind of, like recipes together and seeing where my grandma's chili falls. So I'll show you how we're going to do that. But let's first go to JMP and show you how we do these steps in JMP.   So here is my raw data.   In every document here, every row is is...got my unstructured text and in my cases, it's just the raw ingredients.   And if I click on any one of these cells here, you can see it is just literally the copied in ingredients. So there's the one...the first one for Cajun chili from Louisiana.   It's got words I care about, like "tomato" and "chili powder" and "honey." It's got words I don't care about like "can."   It's probably ingredient measures I don't care about so much, like "one," "two pounds," "teaspoons." So how are we going to take care of this? So what we're going to do   is we're going to go to Analyze and we're going to go to Text Explorer and we're just going to put those ingredients up into the text column.   I'm going to ask for stemming, how to stem for all terms. I find that very helpful. And then I'm going to use the built in regular expression.   I say okay and now here is my initial list.   So what I can do now is I can go and select those things I don't care about. I don't care about numbers. So maybe I can go in here and highlight them, right click, and say add a stop word, and then it gets added to the list of things I don't care about.   What about "chili powder"? It sounds like something that needs to be on its own. So I'm going to right click on that phrase. And I'm going to say add phrase and it adds it in.   So you go and you do this until you get a streamlined sized-down list.   I'm just going to run my   My finished list by it. And here's all the words that came off the regular expression that were found, and also   things I added   and stop words. And now, here is my finished term and phrase list. And so I've added these phrases. I care about. So "onion" came out the highest, then "salt," then "cumin," then "chili powder."   As you can guess, this would be really good to visualize and we do have a word cloud. So here is the word cloud for everything in this one. And again under my red triangle options here, I can change that to an order to make it crisper on what comes out.   If I keep it centered, something that's fun to do. You know you can add filters to your data and something you can do is, sometimes you can find your answer visually without having to do anything else. And so in that case, I'm going to go into this to my   red triangle. I'm going to get a local data filter. I'm just going to look at the type and I'm going to say, well, let's take a look at   what Grandma Lillian's chili looks like, you know "tomatoes" and "kidney beans" and "beef," that type of thing. And how would that compare to the Cajun chili?   Well they shared chili powder and beef, but, you know, there might be some different things on there. You know, how that compared to the chili verde,   you know, which is more of the green chili, you know. And they've got raw chilies in there and jalapenos, chicken stock, all that type of thing, chicken broth.   So this is really interesting, but probably not enough for me to figure out what's going on. So I did go   (and this is another...you've got all the options here under your red triangle) I did go make sure (and I probably need to make sure here) that I turn off my local data filter, make sure everything selected, that I'm looking at all my terms.   I got 299 here. I'm going to right click and I'm going to say "save document term matrix."   And when I do that, it asked me what kind of way I want to save it, with with with with weights, what kind of weighting. It's basically binary, there's frequencies I can use, how many terms, you know, the minimum term frequency to actually get a place in your data table. And I have already done that.   So if I slide across and look...   Aactually   I'll go ahead and do that and show you what it looks like. So I'll just say "save document term matrix" and say "okay."   And now, as I showed you on the slide, now I get all the terms I care about. And there's that first one for "onion." Here's the one for "chili powder," and as it relates to their respective   Recipe, you know, rows.   So I know where there's a one, I know this Cajun chili had chili powder.   When there's zero,   this one here looks like Oklahoman chili or no, I'm sorry, New Mexican chili did not have chili powder, so that's just, that's just how that works. And this can be used in modeling. So you can go to Analyze fit model in JMP and take...and actually apply this to some type of model.   But what I'm going to do is, since I'm not really trying to predict, you know, what ingredients will give me a higher score. I don't have any like, you know,   you know output data here. I really want to group them together and I heard in JMP Pro, I can actually do this. So I'm going to go right back to my slides here.   And I'm going to talk about analyzing the data. So to do further analysis   in JMP Pro, it enables you to do some really good grouping techniques and these are multivariate methods and their specialized for handling text analytics and working with those document term matrix is about text.   And it uses something called latent class analysis. It's one of the terms. And this is similar to the principal components, if you're used to doing that technique. But basically it's going to   ask us how many groupings or clusters of data do we want to look at. It's going to look across that multidimensional space   between everything that's in those columns and your document term matrix for the important terms in your model here, the important terms we got on the word   word list, right? And it's going to group them. And in my case, I was able to get it down to three groups.   So there's a cluster one, which seemed to have a lot of chili recipes with ground beef, tomato sauce, chili powder and beans.   There's a cluster two, which had a lot to do with chicken and green chilies, raw chilis here.   And then there's cluster three, which had a lot of chilies again, but they were more of the red chillies and they were kind of pork based and this made a lot of sense.   Okay, so when I created these clusters, I was able to use a cluster probability by row, this kind of gave me how strong those individual   recipes, my rows in my document, right, these these original...my control recipe, where did they fall and how strong did they belong. Why did I assign them into whatever cluster? And when I did this,   22 was my grandma's cluster...was my grandma's control. And I found that she clustered in cluster one along with some other recipes, including those that came from Kansas City, Missouri.   There is one on number 24 which was very close. Now the Texas recipes for cluster three, they had hot chilis, spice...a lot of spice in them, and often often pork and no beans, right? Beans were something that showed up in cluster one, but not in cluster three and then...   The cluster two   was more for, you know, it's more for green chili,   more for those things you see in New Mexico, you know, chicken-based chili, things with green chili.   Alright, so what happened was I was able to make the match, and I found a recipe. And one of the three representative ones from Missouri   that actually was called Kansas City chili, and it almost matched exactly Grandma Lillian's chili.   So when I asked my mother about this, I said, "Well, why could this happen. I didn't think it came from Missouri." And she said, "All this makes sense." She said, "Grandma Lillian,   she grew up on a farm in St. Joseph, Missouri, and she was the only girl and she had like 11-12 brothers. So she did a lot of the cooking."   By the way, she was the only one to get a college education and so she was quite progressive for for the time that she lived and was one of my favorite relatives, but her recipe was very indicative of this. So let's show you what this looks like live.   So if I go to...   at this point,   I go back to the data I had made and under that hotspot, I'm going to ask for these additional models that JMP Pro can give. There's a latent class analysis clusters documents using that method,   based on the binary way to document term matrix. So it does use a doc...does use that document term matrix, yeah so you don't even have to save it out, it automatically generates it.   There's also a latent semantic analysis, which does...which does a little more math, a little more advanced method, but both of these are basically doing the same thing, and I particularly liked this latent class analysis. So that's the one I selected.   And I asked for three clusters, you can play with it to see if it makes a difference. And I did try more clusters and I broke back to three.   And within its   options, you can look at the cluster probabilities by row.   And of all the output, this is the one that made the most sense. So remember, back to my slide, this helped me look at where my grandma's chili fell,   Which was 22,   row 22. And then what else it combined well with. And so that's how I was able to do this analysis.   It's that simple.   So,   that was a really quick run through the capabilities of doing JMP text exploration in JMP and then how I was able to use JMP Pro and   find these clusters and place my grandma's chili and find a matching recipe. So if you're hungry for more, I do have a link to the blog   in the presentation that you can...you can go click on or you can just go right to the Community and you can just type in "grandma's chili" and you can find that blog. And I also will give you along with that, we will give you as well   the recipe. So you too can make Grandma Lillian's chili.   So we appreciate being able to show this to you today.   Please be sure to leave any questions in the Q&A that we can answer. And try this, try something   that you have around you at work, at home, wherever that has some unstructured text data, where you would like to explore and ask the question, and you'll find it's a fantastic, fantastic method, very powerful and really helps you attack that third dimension of data.
2020年11月に開催した「Discovery Summit Japan Online」で、「さらに詳しく聞きたい」という ご要望が多く寄せられたチュートリアルセッションであった「特別チュートリアル JMPによる 統計的機械学習入門」を、5月13日(木)、5月20日(木)に、2回シリーズで開催をさせていただきました。   本セミナーの動画を、2021年6月15日(火)17時までの期間限定にて公開いたします。 ※公開を終了いたしました。   セミナータイトル 特別チュートリアル JMPによる統計的機械学習入門(全2回)     概要 JMP (Pro)を使えばR , Pythonなどに較べて手軽に分析を楽しめます。 フルオーダーメイドの分析とはいきませんが、セミオーダーには十分に対応が可能です。 JMPを使えば以下のようなことが簡単に実行できます。   ① コマンドを打ちこまなくてもマウス1つで分析が可能に ② グラフと統計量のセット ③ 分析プロセスをスクリプトに残せる ④ 分析プロセスの流れに沿ったレポートの出力が可能 ⑤ 統計的な思想が基本にあるから体系的な理解と学習に最適 など   本報告では数値例を使ってJMPでできる予測や分類の話をします。 扱う方法はカーネル平滑化、SVMやニューロ判別などです。 また、従来の統計的な多変量解析との対比も行い理解を深めます。   講師ご紹介 廣野 元久様 1984年、株式会社リコー入社。以来、社内の品質マネジメント・信頼性管理の業務、 統計学の啓発普及に従事。 品質本部QM推進室室長、SF事業センター所長を経て、現在はバイオメディカル事業センタ ヘルスケア事業支援室 薬事・品質保証G(倫理審査委員)として社内外での教育・講演などを 幅広く行っている。   東京理科大学工学部(1997-1998)、慶應義塾大学総合政策学部 非常勤講師(2000-2004)。 主な専門分野は統計的品質管理、信頼性工学。主著に「グラフィカルモデリングの実際」、 「JMPによる多変量データの活用術」、「アンスコム的な数値例で学ぶ統計的計算方法23講」、 「JMPによる技術者のための多変量解析」、「目からウロコの多変量解析」などがある。   各回のタイトル 第1回:ビッグデータで役立つJMPのグラフ機能 第2回:教師あり分類の実際
レベル:中級 自動翻訳を利用) 研究結果を要約する際の優れた視覚化の価値を誇張することは困難ですが、同僚、業界の仲間、およびより大きなコミュニティと共有するための適切な媒体の選択も同様に重要です。このプレゼンテーションでは、データ、結果、視覚化を広めるために使用されるさまざまな形式について説明し、それぞれの利点と制限について説明します。JMP Live機能の簡単な概要は、エキサイティングな一連の潜在的なアプリケーションの準備を整えます。豊富なインタラクティブインターフェイスとスクリプトメソッドを使用してJMPグラフィックをJMPLiveに公開する方法を示し、最適なアプローチを選択するための例とガイダンスを提供します。プレゼンテーションは、ダイアログの設計で行われた考慮事項、パブリッシングフレームワークの仕組み、JMPライブレポートの構造、およびJMPクリニカルクライアントレポートとの関係を含む、JMPクリニカル結果用のカスタムJMPライブパブリッシングインターフェイスのショーケースで締めくくられます。公開されたレビューの潜在的な消費パターンの議論。   本発表を日本語トランスクリプトと共に視聴されたい方は、こちらをクリックしてください。 Discovery Summit Americas配下のセッションページに移動します。 (SAS Profileへのログインを求められますので予めご了承ください。   下のビデオでは英語の字幕が選択できます。