Anabolic, Aphrodisiac or Analgesic?
As we develop analytical tools in JMP, we inevitably must make decisions about how to prioritize:
- Should we make the product more powerful by adding more muscle?
- Should we make it sexier, more exciting?
- Should we focus on pain relief, making it less frustrating, less burdensome? In the language of long A-words, should we go for anabolic, aphrodisiac or analgesic?
John Sall thinks the answer is analgesic. Pain relief should be the central motivating force for development. Of course, the three aren’t mutually exclusive. Adding an exciting power feature could also relieve pain. But pain relief is central, because pain is the condition that can really freeze us, demotivate us, make us stop at a less-than-full perspective of what our data can tell us.
Auto-generated transcript...
Speaker |
Transcript |
Jeff Perkinson | When it comes to developing JMP features, John Sall, our next speaker, has an interesting way of thinking about how we prioritize. We do have to prioritize which features we... |
which features we want to invest in. So how do we think about what's most important? Are features that make our product more powerful what we should work on? | |
How about features that make it more attractive or sexier or features that focus on pain relief, easing the burden of analysis? | |
Well John thinks that pain relief should be the main motivator for our R&D team. John is the co founder of SAS. He's the lead architect of JMP, which he founded more than 30 years ago. | |
For scientists and engineers who have data and need a way to explore it needed...need a way to explore it easily visually on the desktop. | |
He's going to talk about the driving forces behind JMP development and the ways that we come across features we want to work on. So with that, I'd like to welcome you, John. Thank you. | |
John Sall | Thank you, Jeff. And here we are live from the new digital conference center in Cary North Carolina, where we can fit a lot more people. |
So let's dim the lights a little. | |
Well, I'm not really here in digital conference center. I'm really at home. So I got a home office. Whoops. Not that one. | |
Like all of us. I'm at home, delivering this conference. But let's switch to a corporate background and get started. | |
So what should I talk about? | |
Well, when we have JMP releases, on those years that the conference alliance along JMP releases, that's what we talk about, but on the alternate years I always just pick a topic. So I picked big statistics one year, or ghost data the next year or secret features | |
two years ago. Today my topic is big words that start with the letter A. | |
Well, these are words characterized themes for developing JMP. So let me share my screen. | |
And | |
minimize | |
here. Share. | |
So which theme is most important? And here are the three words start with...the big words that start with the letter A. | |
As Jeff just mentioned, first is anabolic. Well, anabolic is all about growing muscle, making JMP more powerful. | |
Anabolic processes build organs and tissues and these processes produce growth and differentiation of cells to increase body size. | |
So increasing the power of JMP would be the opportunity and that's of course very important, but the next | |
aphrodisiac, making JMP more sexy, more exciting, the thrill of discovery. So we want that to be an attribute too. | |
But we also care about your progress during the work stream, so analgesic is the third word, and that means pain relief. | |
pain relief. | |
So I want people to express things like the following imaginary quotes, \!"Version 15 has just served me...saved me hours or even days of work that I used to have to labor over.\!" That's a lot of pain relief. | |
Or \!"I used to have to do each one individually and now it is all in one swoop.\!" A lot of pain relief there. Or \!"I used to have to use several other tools to get everything done, but now I can do everything in JMP.\!" | |
Or \!"I showed my colleagues how I did it so easily. And they were amazed.\!" Or \!"I used to have to program all this manually, and now it's just a few mouse clicks and it's done.\!" | |
Or \!"JMP keeps me in flow, keeping my attention on analyzing the data rather than diverting it too much to operational details.\!" These are the expressions of pain relief that we want to hear. | |
Now why well pain and frustration are real demotivators. Flow is important. You're undistracted when in flow. When in flow, you're, you're more likely to learn something from the data and make discoveries. | |
Now empower is important too, but new features tend to be used by only a few and old features by many. So we want to make the old features work better | |
to reduce the pain in using them. And of course, productivity is hugely dependent on how effective our tools are with respect to the use of time. So if we can make it easier and faster to get to the results, that is a huge win. So let's go through a lot of examples. | |
First, the scripts that come with the data tables. So here you see side by side, an example from an earlier release of JMP. I think it's JMP 14 or 13, where | |
where you used to have to hold down the mouse key on this button and then it brought up a menu item and then you found a run button to run that script. | |
But the new way just has a play button right next to the script. So you just click on that play button and it saves pulling down a menu. | |
Well, that's not a huge win, you think that just the difference between clicking a button and pulling down a menu is not very much. But it's also a big win for new users, because new users no longer have to understand | |
that a hot button will have a run underneath. But there's the play button right there with the script. And so there's a more understanding that these are scripts. | |
And the ability to store scripts with your data is a big deal in JMP and we want people to learn that right away | |
and not have to fish around two or look at a tip to learn that that's the way it does. And it's a great convenience to be able to store all your scripts | |
within the data table itself and not find other places to store them. So for new users the play button. So, that is some pain relief just in that simple change. | |
And this is a change that we regret not making much earlier in JMP. | |
Let's talk about preparing data. And one of the big things that you have to do when you prepare data is join data from multiple sources. | |
Now JMP, even from version one, had to join utility. But this still involves complex dialogue and working with multiple data tables. | |
So here's the classic example where I'm renting movies and we have our movie rentals transaction table | |
that has the customer, the item number, the order date, the order year, and then we have our inventory of movies to rent, details about each movie. And then we have the customers...information about the customers and so on. And we want to ask questions, involve all three tables like | |
which genders are prevalent in which genres of movies. Well that involves | |
going across all three tables, and of course, the join command is the way to do it. I take the transactions data and I | |
merge it with customers. And of course, customer ID is the matching | |
column and I want to include the non matches from the transaction side. So if I don't have customer data, I still don't lose my transaction data. And that's called the left outer join, and so that makes the new data table. And now to that data table, I want to join that | |
to the inventory data. So I join that by the item number which I match then and that I also want a left outer join and now I have | |
another table. And so now, instead of having three tables, I have five tables, the results of that join, but now I can look at answering the question of gender by genre, which | |
gender tends to rent more of which type of movie. | |
And involved making all these new intermediate data tables as well. | |
It turns out that there's a lot easier way to go and the pain of just these extra steps and extra tables can be vastly reduced. So let me undo this table on this table. | |
And I want to point out under the tables command, there is a command called JMP query builder and this brings up the the sequel facility | |
for joining tables, that's much easier than all those join dialogues, but it still also makes new data table. | |
But a couple releases ago, we came out with something even better that reduce the pain even further and that's virtual join. So if I look at this transaction table, | |
I noticed that it's already joined. | |
So what's happened is that when I prepared the, the movie inventory table, it's uniquely identified by item number. And so I give item number then | |
a link ID saying it's uniquely identified by that and that's the way I can index that that table. And similarly, I have an item ID to identify the customer data. And when I prepare my transaction data, I have these two have, in a sense foreign keys, | |
a link reference and it's referencing different data table by the, the...the identifying ID across them. And so, it automatically links it and I already have everything I need to answer that question. So I can do | |
for each | |
gender, I can | |
find the genre of the movie I want to do. And so I can see here with action movies are liked more by the males and family movies by the females and | |
rom com more by the females and so on. So I've answered my question involving three tables without the need of going through all the joins, because it's virtually joined by matching up those ID columns across the tables. | |
So I've made | |
a lot of pain relief from that. So let's close that, in fact. | |
So data table joining has involved, I think, a lot of pain relief over the releases. | |
Recoding, | |
that's one of the main other things we do when we're preparing data sets. | |
We started with a very simple recode many releases ago where we just had a field for the old value and the new value and the new value was started out with the same as the old value and then I can cut and paste or edit those values in order to recode | |
into a new column or the existing column. | |
And recode every release has gone through dramatic improvements and we're very excited with the current state of that. And so let's look at this data set of hospitals, district hospitals. And now recode is right on the | |
right button or right click menu in the column header. And I can go through and just | |
select a bunch of things and group them to new value, like clinic. | |
But I think there are lots of shortcuts as well. And one of my favorite shortcuts is | |
to group similar values. And so I can give some threshold of the number of character differences and so on and it will automatically group all those things. So it'll group all the district hospitals with slightly different spellings into one. | |
And I can still... | |
looks like it missed one. But it's okay, I can now group that one with the others and | |
all the, all the rural hospitals have been automatically grouped by those matching and looks like I can match an extra one here. | |
And now all of a sudden I have something. I've saved a lot of effort in doing all that recoding. | |
And of course, in JMP 15 we added new features, which I won't show here, but I want to point it out. If you have another data set with all the | |
...proved that...the best category names for those categories, I can go through a matching process and have it choose the closest match when it finds a match to all those things. So | |
so recode has gone through a lot of evolution over the releases to reduce the pain in that operation. And of course, think about it. How much time do we spend preparing the data versus analyzing it? | |
Often it's anywhere from 70 to 90% of our time is spent preparing the data. And so, reducing the pain of that and reducing the time of that become major wins. So pain relief on recoding. | |
Clicking on values. Here's | |
the cities data set. | |
And let's suppose when I'm looking at the details for some of these things, like might be an outlier for something, I want to look at it in greater detail. And so I can turn that into a link. So let's | |
let's turn it into a wiki page Wikimedia page thing. I'm going to copy that text there and go to column info and then turn that into a link, which is the expression handler. | |
And now I can take | |
this | |
table and convert it into a web | |
address. So | |
I'm going to go to wikipedia.org and then take the | |
the name of that city and change it to title case and I can test it on that row and brings up Albany. | |
And so now I'll say... | |
Oh, OK. | |
And now I can click on any city and then get the Wikipedia article on that city. Or it can change it to map coordinates. So I can...let's copy that text and go to the column info and now instead of that, I'll | |
do title case on that. | |
Let's see if that parses OK. And now I can click on Denver and it will query Denver and the Google Maps. | |
Okay, so I've made things into links to do...I can do searches, can do map requests and so on. I can even | |
paste this speak into it and click on that and have it speak the name of the city. And so this I think gives great power to be able to store links in our data tables | |
that link out to web pages or do other things, anything you can express with ACL, you can do with those event handlers. So that can reduce a lot of pain. | |
So one of my favorite pain reduction techniques is broadcasting, and this is where you hold down the command key or the control key to do multiples, where you have a lot of analyses that are similar. Okay, so | |
this works not just for menu commands, but for many buttons and for resizing graph frames and doing other things, pasting things into graphs and so on. | |
So let's do an example of that. And let's click on...or just do a distribution on all these columns. | |
And let's say I wanted to get rid of | |
this box plot. | |
And if I just held down | |
that menu item, I would eliminate the box plot for that item. But what I want to do is eliminate the box plot for all items. So what I'm going to do is hold down the command key and uncheck the box plot and now it's unchecked for all those things. I can now uncheck the quantiles | |
and it will uncheck for everything, because it has taken that command and broadcast at all the active objects in that window, and for those active objects that understand that command, the quantiles command for example, it will then obey that that command. | |
Now that even works for some things that have to do with prompting dialogues. So let's look at summary statistics. Let me hold down the command key and customize summary statistics. | |
And let's say I don't want the confidence limits, but I do want the number of missing values. And I'll say, okay, and it's going to apply that to all... | |
all the things. In order to do that, it's had to | |
take the results of that dialogue, make a script out of it and then broadcast that script to all the other places in that window. | |
Now that doesn't happen all the time. Sometimes when you get a prompt, you will get a prompt for every analysis. For every release we try | |
to implement a few more details, where the dialogue is done before the broadcast so it'll broadcast the same dialogue to every place available. And of course, everything else I can broadcast a change and it will change, you know, the size of all these items, or I can broadcast | |
changing the background color, and I can change all the background colors to orange in all the plots | |
that that seem relevant for that that frame box. So this broadcasting becomes a very powerful tool. | |
So, | |
that has saved a lot of pain for most of us, but there's still cases where you may have 40,000 by groups, as came in earlier this year, and he wants to do an inverse prediction on all 40,000 and it prompts you 40,000 times. | |
That will be fixed for version 16. That is a lot of pain relief for that one user. | |
Saving formula columns. | |
I love formula columns. When you fit a model you can save a column of predictions. But in that prediction, there's a live formula so that you can examine that formula, you can modify it, you can apply it to new data that comes in. You can profile it to the save formula. You can | |
if you have | |
JMP Pro, you can go to the formula depot and do a lot of other things. You can generate code from it and so on. So the ability to save formulas is an important thing, but sometimes | |
if you have by groups, for example, saving formulas has a lot of extra subtlety. | |
For example, let's fit, this is the diabetes data, where we fit the response | |
against all these predictors and but we want to do it separately for each gender. | |
So let's do that. | |
And so we have two by groups for gender one and gender two. And now let's say we want to save | |
the predicted value, in fact, for both of these. Now | |
in the old days before we subjected to this to a lot of pain relief methods, | |
what would happen when you saved it, it would go to a local data table. If you look in the data table window under redo, it shows the by group table. | |
So really for gender=1, there's 235 in this virtual data table here, which you can show for gender=1. There's a table variable here that shows the the by group for it. | |
And it would save it to this temporary table instead of the real table that you have that is really saved. | |
But | |
some time ago, when you saved the prediction formula, we save it to the real data table. And if I look at this, I just saved it for gender=1, | |
and if I look at the formula for that, it shows that if gender=1, that was the by group clause, then it has this linear combination that forms a prediction for that variable. | |
And then if I then do gender=2 or if I held down the command key to broadcast prediction formula, now it has both of them and I have both clauses available for gender 1 and gender 2. So it's adding a clause each time to that output by group. | |
By the way, the same thing happens when you save other things, when I save a column of residuals, and let's hold down the command key this time. | |
If I save a column of residuals, it will save it for each by group, it will save the residual appropriate for that by group. It's as if you subtracted predicted formula from the original response, but this time without being a formula column. | |
In almost all places in JMP, we haven't done it everywhere, but we've done it in most places, saving into a by group | |
or in some cases through a where clause, it will then save it to the original data table with the condition in the if statement. Okay, so that has | |
saved a lot of effort. If that hadn't been available, tt was saved in a temporary table and then you'd have to cut and paste that formula and make if tables to the original table. So we've solved these problems. Now let's suppose I saved it again, did that whole process again, | |
and | |
and did another thing, say with different variables, some of them removed, | |
and now saved it again. | |
So, | |
I save prediction formula, this time holding down the broadcast key, it's actually making a new one instead of saving into the old one and appends a 2 after it. Well, how does it know not to save it into the old one? | |
Well, each fit has an ID to it. And if I look at the properties | |
by ID, it has the number to it, a unique number which is regenerated for each fit, and so as long as it has a different by ID, | |
the different by groups will have the same ID with one by partition. But as long as it has the same by ID, it will save it into the same thing. Otherwise it will make a new place to form to save it. | |
Also, whenever we say predictive values, | |
especially prediction formulas, | |
it will create a attribute the, the creator and in some cases other information. In this case it has the target variable y, that's what is predicting and the creator fits least squares. | |
And then when you do other platforms such as if you have JMP Pro model comparison, then that model comparison will understand which which predicted values referring to which creator, in which target. And so it can keep track of all that information for those added value platforms. | |
So, | |
formula columns work by adding new clauses and other properties are used, including the by ID. | |
And the prediction clauses, if you save them to categorical variables, it saves a whole range of variables to have the probabilities for each response level, and all that is contained with all the metadata and needs. And that should save a lot of pain. | |
So, | |
removing effects from models. | |
Well, let's look at an example | |
of fitting | |
just fitting height by weight. | |
Well, here's an example of a high degree polynomial. This is a seventh order polynomial | |
that fits better than | |
sixth order polynomials and so on. But would you trust that fit? | |
It turns out that if you give the model a lot of flexibility by introducing high polynomial terms or just by introducing more variables, it gives a lot more opportunities for to fit. And in this case it's | |
it's allowed the flexibility to make a deep dive between the the previous data and the last data point just to fit that last data point better. | |
And so it's overfitting. It's way...it's using that parameter to fit noise instead of fitting the | |
the data itself. | |
And so overfitting is is a problem anytime you have big models. So | |
you want to fit the signal, not the noise. Okay, large models tend to introduce more variation into the prediction, because the prediction, after all, is | |
a function of the y variable they're using, but also that y variable is is is a systematic part of that variable plus the error part of that variable if that variable part | |
is random, then your prediction involves that that randomness. And if you allow it too much flexibility, it's going to end up with | |
an overfit problem that you're going to predict much worse by including all the variables in that model. | |
So the cure for that is to shrink the model, to reduce the size of the model or reduce the size of the coefficients in the model, | |
so that less of that variation from the random term of the model is transmitted into the predicted value. And so | |
in small DOE problems, this is not an issue because the data is small and the data is is well-balanced to fit exactly what model you're going for. But for observational data in any large models, overfitting is a real problem. | |
So users often didn't appreciate that until we introduced cross validation in JMP Pro. So if you have JMP Pro, you can set up a validation column, | |
which will hold back some of the data in order to estimate the error on that hold back data set. And here's an example where I have...I'm trying to predict the concrete properties | |
depending on all these ingredients in the concrete, and I have a huge model for it. And if I just run that model, but hold back some of the data and look at the R square on that, | |
I've for SLUMP, I have a great fit. I have an R squared 79, but on the validation set I fit, I have an R square that's negative. | |
Any R square that's negative means it fits worse than just fitting the mean. So if I'd fit the mean, I'd have an R square of zero. If I fit this whole model, I have an R square that's much worse than that. So, with large models, you can go worse than just fitting the mean it's it's worse than | |
It's kind of anti informative because the model is so big and we had no effort and cutting down the size of the model. The model has been dominated by the noise and not the signal. So this is a problem that you should pay attention to. | |
So, the important part is to be able to reduce the size of the model. | |
And now we did introduce a model dialog command to do that. | |
Let's go into the diabetes with with model. | |
I can run this model and if, let's say I want to take out a lot of these things are not very significant. You know, age | |
has a totally non significant contribution to the model, and so I want to eliminate age. Well then I would, I could go back to fit model and | |
recall it or and then eliminate age. Or there's, I can just go back to the model dialogue directly here and fit age and remove it, but I may have a long list of models here and going back and forth to do these things is a fair amount of work. | |
And so rather than do that... | |
let's see, what am I doing here. | |
Several releases ago, we introduced a new report called the effects summary. | |
And with effacts summary, it makes it trivially easy to subset the model to make it predict better, | |
give it less flexibility. | |
So I can say, well, age, I can remove that. | |
Or, well, there's lots of variables. Let's remove three more. | |
LDL, well, that looks more significant. And so I can save that model. | |
Let's just save it to the data table. | |
And I come back later and say, well, let me remember that model that has this. | |
I can actually undo the previous thing. And so I can undo that and it brings back | |
three of the variables and I can undo that and it brings back age. So it actually stores, when when I use the effects summary to edit the model, it actually stores a memory of all those things. And if I look at that script, | |
I can see this history thing. So every time I did effect model, it storing a clause of history. And when I do the undo's, it's undoing it back to the that history. | |
So removing is easy. I can also add things. If I subtracted these two things I could add it back. I could add back, say, age. | |
Not a good thing to do, but I can do it. I can also edit the model by bringing back a small version of the model thing and | |
of the model dialogue and add compound effects and so on. | |
Now another thing that happens with large models and | |
(let me undo this). | |
With large models, | |
you're doing a lot of hypotheses tests. | |
So if you have a large number of hypotheses tests, there's some adjustments that you should consider and one of them is called the false discovery rate. And so instead of treating all these p values as if they were independent tests, | |
I want to apply an adjustment so that those p values are adjusted for the multiple tests bias, for the selection bias involved in subsetting the model. And so I can apply a false discovery rate | |
correction to it. So instead of the regular p value, I have the false discovery adjusted p value for that. And now I'm being more realistic. So this is going to help me with the overfitting problem and the hypothesis...the selection bias problem in doing a lot of multiple test | |
things. | |
So, | |
now | |
let's | |
do the next topic and that's transforms. | |
Suppose | |
you want to do a model, | |
but | |
instead of y, you want to fit the log of y. | |
Well, before what you would do is create a new column in the data set, | |
Log y, | |
and then for Log y, you | |
specify a... | |
do a formula. | |
And I can take | |
a log of it. | |
And now I can go back to my model specification and do that Log y | |
and then I got my fit. | |
Oh, it's missing, what did I do? | |
Forgot to enter the | |
objective, the argument. | |
And so now I can | |
fit the log of Y | |
And now | |
I've done it. But let's do the profiler. | |
My profiler is in terms of the log of Y. Let me save the predicted value. My prediction formula is in terms of the log of y, | |
and now I'm going to have to go and hand-edit that... | |
that formula and take the exponential of that | |
to bring it back on on the original scale. So that's that's a lot of pain doing transforms that way. | |
Well, several releases ago | |
we introduced transforms. | |
So I can take that transform and | |
transform it to the log | |
and now use it directly there as a transform and it's not part of the data table. It's a virtual variable with a formula, but not added to the data table. | |
And now I can fit my model to the Log y. Let's | |
remove age so it fits a little better. | |
Whoops. | |
Had that selected too. | |
And now I've fit the response Log y. | |
So if I profile that, what do I get? | |
Factor profiling, profiler. Instead of profiling log of y, it profiles y. The profiler looks at that as a transform and says, well, I can invert that transform and go back to the original scale. And that's what it does. | |
It | |
untransforms, back-transforms through the log of y to take the prediction and put it on the original scale. | |
And it will do that for most transforms that can unwind. If a transform involves multiple variables and so on, it can't do it and it will just do the transforms. So the original | |
Same thing when I save a column. When I save the prediction formula, | |
it saves a prediction formula on the scale of yrather than the log of y. | |
And so these things are an incredibly time saving and saves a lot of effort in using transforms. | |
So there's... | |
Let me | |
go to diabetes again | |
and do another...consider another transform. | |
When you're doing variable, | |
there's a Box-Cox transformation that you can get. | |
And | |
among the factor profiling, the Box-Cox option | |
tells you if you transformed a whole range of power functions, | |
what, what would be the best to do? Should it be just untransform? That would be a Box-Cox lambda value of one. | |
If you did zero, that would be equivalent to taking a log of it. If you did -1, that would be equivalent to taking the reciprocal of it. | |
The power to the -1. If you did around .5, that would be equivalent to taking the square root of that. | |
And it's telling you that this model would fit better on a transform scale adjusted for that transform, if it was more along the square root transformation, where lambda was .453. That's the optimal value in that the Box-Cox transformation. | |
And you can zoom in on this | |
with a magnifier to get | |
to get it more precisely. | |
So, | |
So now I can transform, and several releases ago | |
I can...I added several columns. One is refit with transform, which will make a new window with the transform response. | |
And another is replacing transform and I'm going to do that. And rather than .453, I'm going to just take square root transformation (.5) and now I fit the model with that transform. | |
And and now lambda best is around 1, which is where it should be, because it's already transformed at once by Box-Cox transformation. And now I can save the predicted value of that and | |
profile it and so on. And I can even undo it. So if I don't like that transform and I want to go back to the original, I can go back to the original and refits. | |
So, | |
we've done a lot of pain saving | |
in transforming responses. | |
So now | |
there's a special pain, a special place of pain when you have a lot of data. | |
And we've gone to a lot of effort to try to solve big problems with less pain. | |
Whether you have lots of rows, lots of columns, lots of groups, many models to try, | |
in today's world we live in a world of big data with big problems. | |
So if we have analyses that were originally designed to handle small problems, it may not be appropriate for large ones, and the central problem with big problems is that there's just too much output to sort through. | |
If by fitting the same model to 1,000 variables, I have 1,000 reports to sort through. If I'm doing looking for outliers among 1,000 columns, there's just outliers, you know, separate reports for each column and so on. I want to be able to | |
more efficiently get through a lot of large data sets. So we developed the screening menu, | |
which is meant to solve these large kinds of problems. And plus, there are lots of places throughout JMP that solve large problems better as well. For example, time series. The new time series forecast platform will forecast many times series, instead of just one. | |
So the items on the screening platform explore outliers, explorer missing values, explore patterns. These are for doing checks of data and then things looking for associations, response screening, | |
process screening, predictors screening. And of course, the time series forecast is a new item. It's not in the screening menu, but it's organized for handling large problems. | |
And all these things take advantage, not just of more compact ways to represent the results so you don't have to the thousands of reports, but they're also computationally efficient. They use multithreading so it takes advantage of your | |
the multiple | |
cores in your CPU to make it very fast. | |
So let's say you got a lot of process data; you have 568 variables. So, which of these variables looks healthy? Well process screening is designed to answer that. And so it can sort by the stability or sort by the capability (Ppk) or which ones are bad off or sort by | |
control chart measures, out of control accounts and so on. | |
But what's even...I even like like better, are some of the tools that show all the processes in one graph. And there's two of them that I love. One is the | |
the goal plot that shows | |
how each process behaves with respect to the spec limits and if it's a capable processes it's in this green zone here. | |
If it's marginal, it's in the yellow zone. If it's not, it's in the red zone, and if it's high up, it has too much variance. If it's to the one side of the other, then there's a problem. | |
With... | |
It's off center, it's off target. | |
But if it's in the green zone then it's a good process, and with version 15 we introduce the | |
graphlets, the hover help | |
so you can see each process as you hover over it. And then the other | |
plot that I love in in summarizing all these these things and reducing the pain of looking through all these reports is this process performance graph. So on the vertical axis, it tells you whether you're within spec limits by the capability Ppk. | |
So if you're above this line at 1.33, then you're looking fine as far as the distribution of values with respect to the spec limits. You're well within the spec limits. | |
If...then the stability index is on the x axis. So if it's a pretty unstable process even if might be capable but unstable. So if you look at that process, it might have some | |
stability thing that wanders around some. And if you're in the yellow zone, you're | |
capable but unstable and so on. The red zone is the bad zone where you're both incapable and unstable. And so looking through hundreds or thousands of processes is easy now, where it used to be a lot of pain. | |
The question is, what changed the most? Here's some survey data where | |
over many different years, you asked a question about their activities. Did you | |
go camping? Or gamble in a casino? Or did you have a cold? You know, all these activities and you want to know which among all these activities (and there's | |
I think, 95 different activities) which of these made the most difference. And so you're looking for, you know...one thing you could do is go through one at a time | |
and fit that activity by year of survey. | |
Instead of looking at one at a time. I can look at response screening | |
and just look at one chart to see what changed the most. | |
And so this is showing a chart for all 95 of those things. First, the significance in terms of the negative log of the P value, | |
which we call the log worth, which is adjusted for the false discovery rate. And so it takes care of some of that selection bias because you're sorting all those those p values. | |
and selecting the behind...the low P values. And I find that renting a videotape changed the most, video cassette tape, and of course they became obsolete. So, of course, it changed. | |
Another video cassette tape changed a lot, collected things to recycling changed a lot, entertained people in my home changed a lot. | |
And here's the one that's less significant, but big effect size, do you use the internet. And of course this survey was started before there was an internet. And so, it changed a lot. So the question on what changed the most was | |
is easy now, where it used to be hard. | |
Another question you asked about big data, where are the missing values. So here are | |
280 variables. | |
Where are the missing values? Do I need to worry about them? And I can look at the explore missing values report and see that, well, there are only five of these variables with missing values. | |
And some of them only have one variable, one missing value, but some of them have a lot. So when I do an analysis, a multivariate analysis, I probably don't want to include | |
376 out of 452 variables. Or if I want to include J, I can go through | |
and impute those missing values | |
by doing that, okay. | |
So, | |
Next question, does the data have outliers? | |
Well, I have 387 measurements, I do in this process. And I want to find out if there are outliers in there. So now we have a facility to do it in the screening menu. I can | |
make this more or less sensitive. | |
Let's make it more sensitive | |
so there's fewer outliers and rescan. | |
I can look | |
at | |
the nines. So often, a string of nines is used to represent a missing value and those nines may be real nines, but they may be just an indicator of a missing value. And so I can, for those, I can say well add those nines to missing value codes and now the memory has changed. | |
And now I can | |
go back up (there's a lot of variables here) and rescan and there's fewer missing values to worry about. | |
So exploring outliers used to be a pain. | |
It still can be a big job, but it's a lot easier than it used to be. | |
Now, in version 15, we added another screening platform. Does the data have suspect patterns? And so here's some lab data from clinical trial. This is nicardipine lab patterns data. There's 27 laboratory results that I may want to look at. And so I invoke | |
the new platform, explore missing values. And this is going to show me, do I have a run of values? So I have the value .03536 but there's seven in a row, starting in row. | |
2065. I can colorize those cells, and I can look at those those values, and maybe it's the last value carried forward, which may not be suspicious. In this case, it's the... | |
it's the same person. So maybe last value carried forward is a reasonable thing to do, but it is a rare, rare event if you've distributed these things... | |
if, if you assume random distribution for these things. | |
Also longest duplicated sequence. So for this variable and we're starting in row 2816 and also starting in row 3034, there are four in a row that had the same values. So if I colorize those things, there's four in a row there. And if I | |
go to the next sequence, there's four in a row that have the same values. So that might be a symptom of cutting and pasting the same values from one place to another. | |
So explore patterns is looking for those things. And there are many other things that you can look at the details on each of those 27 variables and look for symptoms of suspicious things or or | |
bad effects of the way you processed the data and so on. | |
So, | |
explore patterns part of solving big problems. | |
So, | |
Pain Relief. Much of our development is focused on making the flow of analyses smoother, less burdensome, less time-consuming and less painful. Analgesics. | |
Now we don't always know what's painful. So we depend on feedback for what to focus on. When we get those emails that I had to respond to a prompt hundreds of times, | |
we listened to that and we feel the pain and we fixed it so that now you can broadcast into a by group with thousands of things | |
and broadcast the results of one dialogue rather than dialoguing many times. So sending it in made it better for everyone else, because we didn't catch it the first time around. So please give us your feedback. | |
With all the improvement we've already made, we think the process of data preparation and analysis has already become much smoother, much less interrupted, more in,flow. | |
So, | |
instead of spending your time getting over obstacles, you spend your time learning from your data, understanding your data. | |
analytics. One would wonder if they came from the same root, of course, we don't like to abbreviate those two words. | |
Analytics comes from the Greek analyein, and I don't know how to pronounce that. | |
But in Greek, it means to break up, a release, or set free. And it's taking something complicated and breaking it up into pieces so we can understand it. | |
And that's, of course, exactly what we do with analyzing data, data science. And analgesics comes from a different combination of words \!"an-\!" meaning without, and \!"algesis,\!" which is the sense of pain. So same prefix, different...a little bit different roots. | |
And don't abbreviate those. | |
anabolic and aphrodisiac. We care about power. | |
Much of what we do is to give you more powerful tools for analyzing data, much of that in JMP Pro, as well as in JMP. | |
And we hope that data makes it exciting, you know, the thrill of discovery. | |
The thrill of learning how to use power features in JMP and we think it's exciting. You know, it's an aphrodisiac. | |
So power and excitement are also value to us. It's not just pain relief. | |
So what are we going to do next year? | |
Well, big words starting with the letter B. | |
Start with A, next is B, right? | |
Well, next year we have JMP 16 coming. And so that's what we're going to talk about. Who knows what we're going to talk about the year after. So thank you much. And we hope you suffer very little pain in analyzing your data. | |
Jeff Perkinson | Thank you very much, John. We appreciate that. It was a fantastic talk and having been around to witness a lot of the pain over the years, |
it is it is everything you say is absolutely true. Pain relief is an important thing for us. | |
We did have one question that came in, actually, a number of questions; we've answered some of them in the Q&A, but what I wanted to throw to you | |
it both as relief pain and provided some attraction and made JMP more powerful? | |
John Sall | Well, I think everyone's big delight is Graph Builder. |
It feels incredibly powerful to just drag those variables over and do a few other clicks and you have the graph that you want, and you can change it so easily. | |
So it's, it's a thrill. It's a powerful feature and it's pain relief; it used to be harder to do. So I think that's everyone's favorite thing. | |
But of course there's...JMP is a rich product and we're proud of everything in it. Design of experiments, all | |
the great power involved in there and we've tried to make that process easy as well. And so many things come to mind. | |
Jeff Perkinson | Very good. Thank you very much, John, I appreciate it. If you have enjoyed this talk, I have two suggestions for you. One, we will be posting this |