Choose Language Hide Translation Bar

XGBoost Add-In for JMP Pro (2020-US-45MP-540)

Russ Wolfinger, Director of Scientific Discovery and Genomics, Distinguished Research Fellow, JMP
Mia Stephens, Principal Product Manager, JMP

 

The XGBoost add-in for JMP Pro provides a point-and-click interface to the popular XGBoost open-source library for predictive modeling with extreme gradient boosted trees. Value-added functionality includes:

•    Repeated k-fold cross validation with out-of-fold predictions, plus a separate routine to create optimized k-fold validation columns

•    Ability to fit multiple Y responses in one run

•    Automated parameter search via JMP Design of Experiments (DOE) Fast Flexible Filling Design

•    Interactive graphical and statistical outputs

•    Model comparison interface

•    Profiling Export of JMP Scripting Language (JSL) and Python code for reproducibility

 

Click the link above to download a zip file containing the journal and supplementary material shown in the tutorial.  Note that the video shows XGBoost in the Predictive Modeling menu but when you install the add-in it will be under the Add-Ins menu.   You may customize your menu however you wish using View > Customize > Menus and Toolbars.

 

The add-in is available here: XGBoost Add-In for JMP Pro 

 

 

Auto-generated transcript...

 


Speaker

Transcript

Russ Wolfinger Okay. Well, hello everyone.
  Welcome to my home here in Holly Springs, North Carolina.
  With the Covid craziness going on, this is kind of a new experience to do a virtual conference online, but I'm really excited to talk with you today and offer a tutorial on a brand new add in that we have for JMP Pro
  that implements the popular XGBoost functionality.
  So today for this tutorial, I'm going to walk you through kind of what we've got we've got. What I've got here is a JMP journal
  that will be available in the conference materials. And what I would encourage you to do, if you'd like to follow along yourself,
  you could pause the video right now and go to the conference materials, grab this journal. You can open it in your own version of JMP Pro,
  and as well as there's a link to install. You have to install an add in, if you go ahead and install that that you'll be able to reproduce everything I do here exactly at home, and even do some of your own playing around. So I'd encourage you to do that if you can.
  I do have my dog Charlie, he's in the background there. I hope he doesn't do anything embarrassing. He doesn't seem too excited right now, but he loves XGBoost as much as I do so, so let's get into it.
  XGBoost is a it's it's pretty incredible open source C++ library that's been around now for quite a few years.
  And the original theory was actually done by a couple of famous statisticians in the '90s, but then the University of Washington team picked up the ideas and implemented it.
  And it...I think where it really kind of came into its own was in the context of some Kaggle competitions.
  Where it started...once folks started using it and it was available, it literally started winning just about every tabular competition that Kaggle has been running over the last several years.
  And there's actually now several hundred examples online if you want to do some searching around, you'll find them.
  So I would view this as arguably the most popular and perhaps the most powerful tabular data predictive modeling methodology in the world right now.
  Of course there's competitors and for any one particular data set, you may see some differences, but kind of overall, it's very impressive.
  In fact, there are competitive packages out there now that do very similar kinds of things LightGBM from Microsoft and Catboost from Yandex. We won't go into them today, but pretty similar.
  Let's, uh, since we don't have a lot of time today, I don't want to belabor the motivations. But again, you've got this journal if you want to look into them more carefully.
  What I want to do is kind of give you the highlights of this journal and particularly give you some live demonstrations so you've got an idea of what's here.
  And then you'll be free to explore and try these things on your own, as time goes along. You will need...you need a functioning copy of JMP Pro 15
  at the earliest, but if you can get your hands on JMP 16 early adopter, the official JMP 16 won't ship until next year, 2021,
  but you can obtain your early adopter version now and we are making enhancements there. So I would encourage you to get the latest JMP 16 Pro early adopter
  in order to obtain the most recent functionality of this add in...of this functionality. Now it's, it's, this is kind of an unusual new frame setup for JMP Pro.
  We have written a lot of C++ code right within Pro in order to integrate to get the XGBoost C++ API. And that's why...we do most of our work in C++
  but there is an add in that accompanies this that installs the dynamic library and does does a little menu update for you, so you need...you need both Pro and you need to install the add in, in order to run it.
  So do grab JMP 16 pro early adopter if you can, in order to get the latest things and that's what I'll be showing today.
  Let's, let's dive right to do an example. And this is a brand new one that just came to my attention. It's got an a very interesting story behind it.
  The researcher behind these data is a professor named Jaivime Evarito. He is a an environmental scientist expert, a PhD, professor, assistant professor
  now at Utrecht University in the Netherlands and he's kindly provided this data with his permission, as well as the story behind it, in order to help others that were so...a bit of a drama around these data. I've made his his his colleague Jeffrey McDonnell
  collected all this data. These are data. The purpose is to study water streamflow run off in deforested areas around the world.
  And you can see, we've got 163 places here, most of the least half in the US and then around the world, different places when they were able to collect
  regions that have been cleared of trees and then they took some critical measurements
  in terms of what happened with the water runoff. And this kind of study is quite important for studying the environmental impacts of tree clearing and deforestation, as well as climate change, so it's quite timely.
  And happily for Jaivime at the time, they were able to publish a paper in the journal Nature, one of the top science journals in the world, a really nice
  experience for him to get it in there. Unfortunately, what happened next, though there was a competing lab that really became very critical of what they had done in this paper.
  And it turned out after a lot of back and forth and debate, the paper ended up being retracted,
  which was obviously a really difficult experience for Jaivime. I mean, he's been very gracious and let and sharing a story and hopes.
  to avoid this. And it turns out that what's at the root of the controversy, there were several, several other things, but what what the main beef was from the critics is...
  I may have done a boosted tree analysis. And it's a pretty straightforward model. There's only...we've only got maybe a handful of predictors, each of which are important but, and one of their main objectives was to determine which ones were the most important.
  He ran a boosted tree model with a single holdout validation set and published a validation hold out of around .7. Everything looked okay,
  but then the critics came along, they reanalyzed the data with a different hold out set and they get a validation hold out R square
  of less than .1. So quite a huge change. They're going from .7 to less than .1 and and this, this was used the critics really jumped all over this and tried to really discredit what was going on.
  Now, Jaivime, at this point, Jaivime, this happened last year and the paper was retracted earlier here in 2020...
  Jaivime shared the data with me this summer and my thinking was to do a little more rigorous cross validation analysis and actually do repeated K fold,
  instead of just a single hold out, in order to try to get to the bottom of this this discrepancy between two different holdouts. And what I did, we've got a new routine
  that comes with the XGBoost add in that creates K fold columns. And if you'll see the data set here, I've created these. For sake of time, we won't go into how to do that. But there is
  there is a new module now that comes with the heading called make K fold columns that will let you do it. And I did it in a stratified way. And interestingly, it calls JMP DOE under the hood.
  And the benefit of doing it that way is you can actually create orthogonal folds, which is not not very common. Here, let me do a quick distribution.
  That this was the, the holdout set that Jaivime did originally and he did stratify, which is a good idea, I think, as the response is a bit skewed. And then this was the holdout set that the critic used,
  and then here are the folds that I ended up using. I did three different schemes. And then the point I wanted to make here is that these folds are nicely kind of orthogonal,
  where we're getting kind of maximal information gain by doing K fold three separate times with kind of with three orthogonal sets.
  So, and then it turns out, because he was already using boosted trees, so the next thing to try is the XGBoost add in. And so I was really happy to find out about this data set and talk about it here.
  Now what happened...let me do another analysis here where I'm doing a one way on the on the validation sets. It turns out that I missed what I'm doing here is the responses, this water yield corrected.
  And I'm plotting that versus the the validation sets. it turned out that Jaivime in his training set,
  the top of the top four or five measurements all ended up in his training set, which I think this is kind of at the root of the problem.
  Whereas in the critics' set, they did...it was balanced, a little bit more, and in particular the worst...that the highest scoring location was in the validation set. And so this is a natural source for error because it's going beyond anything that was doing the training.
  And I think this is really a case where the K fold, a K fold type analysis is more compelling than just doing a single holdout set.
  I would argue that both of these single holdout sets have some bias to them and it's better to do more folds in which you stratify...distribute things
  differently each time and then see what happens after multiple fits. So you can see how the folds that I created here look in terms of distribution and then now let's run XGBoost.
  So the add in actually has a lot of features and I don't want to overwhelm you today, but again, I would encourage you to follow along and pause the video at places if you if you are trying to follow along yourself
  to make sure. But what we did here, I just ran a script. And by the way, everything in the journal has...JMP tables are nice, where you can save scripts. And so what I did here was run XGBoost from that script.
  Let me just for illustration, I'm going to rerun this again right from the menu. This will be the way that you might want to do it. So the when you install the add in,
  you hit predictive modeling and then XGBoost. So we added it right here to the predictive modeling menu. And so the way you would set this up
  is to specify the response. Here's Y. There are seven predictors, which we'll put in here as x's and then you put their fold columns and validation.
  I wanted to make a point here about those of you who are experienced JMP Pro users, XGBoost handles validation columns a bit differently than other JMP platforms.
  It's kind of an experimental framework at this point, but based on my experience, I find repeated K fold to be very a very compelling way to do and I wanted to set up the add in to make it easy.
  And so here I'm putting in these fold columns again that we created with the utility, and XGBoost will automatically do repeated K fold just by specifying it like we have here.
  If you wanted to do a single holdout like the original analyses, you can set that up just like normal, but you have to make the column
  continuous. That's a gotcha. And I know some of our early adopters got tripped up by this and it's a different convention than other
  Other XGBoost or other predictive modeling routines within JMP Pro, but this to me seemed to be the cleanest way to do it. And again, the recommended way would be to run
  repeated K fold like this, or at least a single K fold and then you can just hit okay. You'll get this initial launch window.
  And the thing about XGBoost, is it does have a lot of tuning parameters. The key ones are listed here in this box and you can play with these.
  And then it turns out there are a whole lot more, and they're hidden under this advanced options, which we don't have time at all for today.
  But we have tried to...these are the most important ones that you'll typically...for most cases you can just worry about them. And so what...let's let's run the...let's go ahead and run this again, just from here you can click the Go button and then XGBoost will run.
  Now I'm just running on a simple laptop here. This is a relatively small data set. And so right....we just did repeated, repeated fivefold
  three different things, just in a matter of seconds. XGBoost is pretty well tuned and will work well for larger data sets, but for this small example, let's see what happened.
  Now it turns out, this initial graph that comes out raises an immediate flag.
  What we're looking at here is the...over the number of iterations, the fitting iterations, we've got a training curve which is the basically the loss function that you want to go down.
  But then the solid curve is the validation curve. And you can see what happened here. Just after a few iterations occurred this curve bottomed out and then things got much worse.
  So this is actually a case where you would not want to use this default model. XGBoost is already overfited,
  which often will happen for smaller data sets like this and it does require the need for tuning.
  There's a lot of other results at the bottom, but again, they wouldn't...I wouldn't consider them trustworthy. At this point, you would need...you would want to do a little tuning.
  For today, let's just do a little bit of manual tuning, but I would encourage you. We've actually implemented an entire framework for creating a tuning design,
  where you can specify a range of parameters and search over the design space and we again actually use JMP DOE.
  So it's a...we've got two different ways we're using DOE already here, both of which have really enhanced the functionality. For now, let's just do a little bit of manual tuning based on this graph.
  You can see if we can...zoom in on this graph and we see that the curve is bottoming out. Let me just have looks literally just after three or four iterations, so one thing, one real easy thing we can do is literally just just, let's just stop, which is stop training after four steps.
  And see what happens. By the way, notice what happened for our overfitting, our validation R square was actually negative, so quite bad. Definitely not a recommended model. But if we run we run this new model where we're just going to take four...only four steps,
  look at what happens.
  Much better validation R square. We're now up around .16 and in fact we let's try three just for fun. See what happens.
  Little bit worse. So you can see this is the kind of thing where you can play around. We've tried to set up this dialogue where it's amenable to that.
  And you can you can do some model comparisons on this table here at the beginning helps you. You can sort by different columns and find the best model and then down below, you can drill down on various modeling details.
  Let's stick with Model 2 here, and what we can do is...
  Let's only keep that one and you can clean up...you can clean up the models that you don't want, it'll remove the hidden ones.
  And so now we're down, just back down to the model that that we want to look at in more depth. Notice here our validation R square is .17 or so.
  So, which is, remember, this is actually falling out in between what Jaivime got originally and what the critic got.
  And I would view this as a much more reliable measure of R square because again it's computed over all, we actually ran 15 different modeling fits,
  fivefold three different times. So this is an average over those. So I think it's a much much cleaner and more reliable measure for how the model is performing.
  If you scroll down for any model that gets fit, there's quite a bit of output to go through. Again...again, JMP is very good about...we always try to have graphics near statistics that you can
  both see what's going on and attach numbers to them and these graphs are live as normal, nicely interactive.
  But you can see here, we've got a training actual versus predicted and validation. And things almost always do get worse for validation, but that's really what the reality is.
  And you can see again kind of where the errors are being made, and this is that this is that really high point, it often gets...this is the 45 degree line.
  So that that high measurement tend...and all the high ones tend to be under predicted, which is pretty normal. I think for for any kind of method like this, it's going to tend to want to shrink
  extreme values down and just to be conservative. And so you can see exactly where the errors are being made and to what degree.
  Now for Jaivime's key interest, they were...he was mostly interested in seeing which variables were really driving
  this water corrected effect. And we can see the one that leaps out kind of as number one is this one called PET.
  There are different ways of assessing variable importance in XGBoost. You can look at straight number of splits, as gain measure, which I think is
  maybe the best one to start with. It's kind of how much the model improves with each, each time you split on a certain variable. There's another one called cover.
  In this case, for any one of the three, this PET is emerging as kind of the most important. And so basically this quick analysis that that seems to be where the action is for these data.
  Now with JMP, there's actually more we can do. And you can see here under the modeling red triangle we've we've embellished quite a few new things. You can save predictive values and formulas,
  you can publish to model depot or formula depot and do more things there.
  We've even got routines to generate Python code, which is not just for scoring, but it's actually to do all the training and fitting, which is kind of a new thing, but will help those of you that want to transition from from JMP Pro over to Python. For here though, let's take a look at the profiler.
  And I have to have to offer a quick word of apology to my friend Brad Jones in an earlier video, I had forgotten to acknowledge that he was the inventor of the profiler.
  So this is, this is actually a case and kind of credit to him, where we're we're using it now in another way, which is to assess variable importance and how that each variable works. So to me it's a really compelling
  framework where we can...we can look at this. And Charlie...Charlie got right up off the couch when I mentioned that. He's right here by me now.
  And so, look at what happens...we can see the interesting thing is with this PET variable, we can see the key moment, it seems to be as soon as PET
  gets right above 1200 or so is when things really take off. And so it's a it's a really nice illustration of how the profiler works.
  And as far as I know, this is the first time...this is the first...this is the only software that offers
  plots like this, which kind of go beyond just these statistical measures of importance and show you exactly what's going on and help you interpret
  the boosted tree model. So really a nice, I think, kind of a nice way to do the analysis and I'd encourage that...and I'd encourage you try this out with your own data.
  Let's move on now to a other example and back to our journal.
  There's, as you can tell, there's a lot here. We don't have time naturally to go through everything.
  But we've we've just for sake of time, though, I wanted to kind of show you what happens when we have a binary target. What we just looked at was continuous.
  For that will use the old the diabetes data set, which has been around quite a while and it's available in the JMP sample library. And what this this data set is the same data but we've embellished it with some new scripts.
  And so if you get the journal and download it, you'll, you'll get this kind of enhanced version that has
  quite a few XGBoost runs with different with both a binary ordinal target and, as you remember, what this here we've got low and high measurements which are derived from this original Y variable,
  looking at looking at a response for diabetes. And we're going to go a little bit even further here. Let's imagine we're in a kind of a medical context where we actually want to use a profit matrix. And our goal is to make a decision. We're going to predict each person,
  whether they're high or low but then I'm thinking about it, we realized that if a person is actually high, the stakes are a little bit higher.
  And so we're going to kind of double our profit or or loss, depending on whether the actual state is high. And of course,
  this is a way this works is typically correct...correct decisions are here and here.
  And then incorrect ones are here, and those are the ones...you want to think about all four cells when you're setting up a matrix like this.
  And here is going to do a simple one. And it's doubling and I don't know if you can attach real monetary values to these or not. That's actually a good thing if you're in that kind of scenario.
  Who knows, maybe we can consider these each to be a BitCoic, to be maybe around $10,000 each or something like that.
  Doesn't matter too much. It's more a matter of, we want to make an optimal decision based on our
  our predictions. So we're going to take this profit matrix into account when we, when we do our analysis now. It's actually only done after the model fitting. It's not directly used in the fitting itself.
  So we're going to run XGBoost now here, and we have a binary target. If you'll notice the
  the objective function has now changed to a logistic of log loss and that's what reflected here is this is the logistic log likelihood.
  And you can see...we can look at now with a binary target the the metrics that we use to assess it are a little bit different.
  Although if you notice, we do have profit columns which are computed from the profit matrix that we just looked at.
  But if you're in a scenario, maybe where you don't want to worry about so much a profit matrix, just kind of straight binary regression, you can look at common metrics like
  just accuracy, which is the reverse of misclassification rate, these F1 or Matthews correlation are good to look at, as well as an ROC analysis, which helps you balance specificity and sensitivity. So all of those are available.
  And you can you can drill down. One brand new thing I wanted to show that we're still working on a bit, is we've got a way now for you to play with your decision thresholding.
  And you can you can actually do this interactively now. And we've got a new ... a new thing which plots your profit by threshold.
  So this is a brand new graph that we're just starting to introduce into JMP Pro and you'll have to get the very latest JMP 16 early adopter in order to get this, but it does accommodate the decision matrix...
  or the profit matrix. And then another thing we're working on is you can derive an optimal threshold based on this matrix directly.
  I believe, in this case, it's actually still .5. And so this is kind of adds extra insight into the kind of things you may want to do if your real goal is to maximize profit.
  Otherwise, you're likely to want to balance specificity and sensitivity giving your context, but you've got the typical confusion matrices, which are shown here, as well as up here along with some graphs for both training and validation.
  And then the ROC curves.
  You also get the same kind of things that we saw earlier in terms of variable importances. And let's go ahead and do the profiler again since that's actually a nice...
  it's also nice in this, in this case. We can see exactly what's going on with each variable.
  We can see for example here LTG and BMI are the two most important variables and it looks like they both tend to go up as the response goes up so we can see that relationship directly. And in fact, sometimes with trees, you can get nonlinearities, like here with BMI.
  It's not clear if that's that's a real thing here, we might want to do more, more analyses or look at more models to make sure, maybe there is something real going on here with that little
  bump that we see. But these are kind of things that you can tease out, really fun and interesting to look at.
  So, so that's there to play with the diabetes data set. The journal has a lot more to it. There's two more examples that I won't show today for sake of time, but they are documented in detail in the XGBoost documentation.
  This is a, this is just a PDF document that goes into a lot of written detail about the add in and walk you step by step through these two examples. So, I encourage you to check those out.
  And then, the the journal also contains several different comparisons that have been done.
  You saw this purple purple matrix that we looked at. This was a study that was done at University of Pennsylvania,
  where they compare a whole series of machine learning methods to each other across a bunch of different data sets, and then compare how many times one, one outperform the other. And XGBoost
  came out as the top model and this this comparison wasn't always the best, but on average it tended to outperform all the other ones that you see here. So, yet some more evidence of the
  power and capabilities of this of this technique. Now there's some there's a few other things here that I won't get into.
  This Hill Valley one is interesting. It's a case where the trees did not work well at all. It's kind of a pathological situation but interesting to study, just so you just to help understand what's going on.
  We also have done some fairly extensive testing internally within R&D at JMP and a lot of those results are here across several different data sets.
  And again for sake of time, I won't go into those, but I would encourage you to check them out. They do...all of our results come along with the journal here and you can play with them across quite a few different domains and data sizes.
  So check those out. I will say just just for fun, our conclusion in terms of speed is summarized here in this little meme. We've got two different cars.
  Actually, this really does scream along and it it's tuned to utilize all of the...all the threads that you have in your GPU...in your CPU.
  And if you're on Windows, with an NVIDIA card, you can even tap into your GPU, which will often offer maybe another fivefold increase in speed. So a lot of fun there.
  So let me wrap up the tutorial at this point. And again, encourage you to check it out. I did want to offer a lot of thanks. Many people have been involved
  and I worry that actually, I probably I probably have overlooked some here, but I did at least want to acknowledge these folks. We've had a great early adopter group.
  And they provided really nice feedback from Marie, Diedrich and these guys at Villanova have actually already started using XGBoost in a classroom setting with success.
  So, so that was really great to hear about that. And a lot of people within JMP have been helping.
  Of course, this is building on the entire JMP infrastructure. So pretty much need to list the entire JMP division at some point with help with this, it's been so much fun working on it.
  And then I want to acknowledge our Life Sciences team that have kind of kept me honest on various things. And they've been helping out with a lot of suggestions.
  And Luciano actually has implemented an XGBoost add in, a different add in that goes with JMP Genomics, so I'd encourage you to check that out as well if you're using JMP Genomics. You can also call XGBoost directly within the predictive modeling framework there.
  So thank you very much for your attention and hope you can get XGBoost to try.