You have a business or research question, you’ve collected or found appropriate data, and you are ready to analyze. But which analytical methods should you try? And how will you choose a final model?
Full Transcript (Automatically Generated)
So I'm Ruth Hummel, I work as a technical person helping out people in the academic area. And Mary is a manager of folks like me for a lot of the United States. So we get questions a lot from people about which model should I use? How do I decide which model to use? And I'm going to warn you now, we're not going to answer all your questions in 20 minutes. But what we will focus on in the next 2025 minutes is the idea that when you're asking what model to use, you really need to start by asking What are you trying to do? What is your goal? What are you trying to accomplish in your data analysis? So to kind of motivate some things, as we talk about this, we're going to mostly stick with this example about housing prices. So we have a data set. We have information like the price of a home We have other information about features of that home, like the number of bedrooms in that home, the number of bathrooms, the lot size, the year it was built, and so on.
Well, Bruce, so Ruth and I, we think about what model, What's your goal? What are the questions you're trying to answer? And so we put up these four quadrants not to have you guess, but to kind of guide you along in our thinking and how we break out the information based on the questions that you want your data to tell you. So once again, what is the goal of your modeling? So the first one that Bruce handily put up for me, is segment a segment known as segmentation known as clustering. And it's just a way to group information like things similar things and we take a look at that to find out behavior. We're looking to see who might be wanting to purchase and see if there's similarities in that. purchase. So segmentation clustering, you might hear us mostly refer to it as clustering, because that's what we have in jump. And the second quadrant that we're going to, so we're focusing on four big areas. There's lots of both you might have outside of this, but some four main areas.
The second of our main areas is explaining, you might need to explain a relationship, Oh, I should talk these guys away, shouldn't they? You're all seeing this, you might need to explain a relationship where you specifically want like the mathematical formula so that you can pull out those numbers those estimates and interpret them. For example, you might want to know how much should you expect to pay for an extra bedroom in a home. So what you you'd be fitting a slope to that and you pull out that slope and interpret that number as how much an extra bedroom costs when you're determining house price for debt.
So, this is what you if you want to use existing data to predict a future outcome right shield might be interested in in predicting housing price, right? So you might you think about I want a three bedroom house, I want five bathrooms. So when company comes in, I want a huge lot for my dog. So, and I wanted in this neighborhood. So I want to predict the outcome, could I afford a home? Like, what would the price be for that home. So this is the area where we want to predict.
And our fourth quadrant here is the idea of identifying that you don't care as much about explaining the relationship or predicting an outcome, but you care more about identifying which factors are important. You might care about this when you have to keep collecting that information and you'd like to know what you could stop collecting and still get a good predictive model. So for example, which of the many characteristics of a house are important to whether the house gets an offer? So if our if we're trying to predict offer which variables are important in that prediction process, You might have multiple goals or you might iterate between them all and you're probably touch all four quadrants. But we felt today was the best way to kind of break up the information that we wanted to share with you about which model when?
Awesome, so hey, Mary, you want to play a game? Oh, yeah, you know, I love games. And you know that. Okay, so, analysis. Number one, we would like to identify which features of a home are the most important to determining the price. So if I am a new housing price website, so I'm going to collect information from MLS listings, and I'm going to go out and look at homes and take pictures and put this stuff on my website, what's the important information that I need to include in my website so that I can build good pricing models?
For example, maybe I would like to find out that square footage and the number of bathrooms is the those are the two most important factors. So we might want to get something like what this little picture of column contributions is showing us the idea that square footage explains a lot of things What's going on with price and bathrooms explains a lot more and then everything else is kind of diminishing returns.
So Mary, which type of which quadrant? What kind of gold Do you think we have here? Which gold? Hmm. I think I think column contributions, square footage, number of bathrooms. important factors. I think you're right. I think we're looking for important factors. So for in our vocabulary here, we call that the identify goal, identifying Oh, identify 50% on that one. Okay. I'll take it. We're trying to identify the important factors with this goal.
Yeah, we're kind of in the process. making a prediction equation possibly that might be part of our goals. So we might have a secondary goal here as well. But our what we're really asking for here is identifying which factors are important. So there's three tools that we want to recommend in jump if this is your goal or one of your goals. One is from the Analyze menu, the screening option and predictor screening predictor screening will actually run a bootstrap forest and give you this column contributions kind of idea.
Another option is to just run the bootstrap forest from the predictive modeling option under analyze. And the third option is to use the Analyze fit model and the generalized regression personality because you get a slightly different sort of answer to which factors to include there you could also try stepwise selection as another option. So if you use the predictor screening, you're going to get something like this top left corner output which is saying here's the rank
This is what's most important square footage is most important, Basil's next most important. We see something really similar from the bootstrap forest output again, square footage super important balance pretty important, and then things lower than that are ranked lower. If we use generalized regression, we're actually instead of focusing on those full factors where we can put things like two way interactions and three way interactions in the model, so the predictor screening and call and bootstrap for us, they are considering nonlinear relationships.
They're letting us have lots of splits in lots of different places. But generalized regression ties back more to a regression framework where we might build the model with interaction effects. so here we can actually narrow down even from interaction effects. So again, we're seeing the same kind of concept. The generalized regression also tells us square footage is really important. Babs is really important, and so on.
Alright, ready, Ruth. So we want to build a model to predict housing prices based on any other important predictor variables we have in our data. For example, we want to get the predicted price when we input the specifics for the house. Well, I got an Easy one because we wrote the word predicted in here about 1000 times, so I want to make sure I left left breadcrumbs for you, Ruth, I'm gonna guess this is a predicted goal. Yes, it is.
Okay, awesome. So with the predictions, you know, under the Analyze predictive modeling, we have partition our tree methods and of course, neural nets. And, and we can't leave out generalized regression. But here we are. Here's the dialog boxes. And we're looking at we want to predict price, right? So we want to look at what I really want, with beds bath lot size in years built. I want to know if I can figure out the price and can I afford that with that combination. So we have the bootstrap force, which is sort of as a tree method that we build trees based on as like a model averaging, it's with replacement.
And then we have the fit model generalized regression, which we use to do the random sampling. And it also can handle the correlated variables. Let's look at generalized regression lasso. And we find with that, that that's a prediction. That's our prediction profiler. And if you bring up the other one, Ruth, that's the prediction profiler for the bootstrap force. And what those kind of tell us is it tells us the louses that you know, basically turn the dials and see the region or the areas where in combination what the prices might be able to predict. So if I had the opportunity to move to lot size and pick the largest lot size, and I wanted to be in the suburbs, and I wanted five bedrooms, and I want at least 4500 square feet, I think my prediction would be that I couldn't afford it.
And when you when you get to run these models like bootstrap In generalized regression, you we have the ability to compare them. And what's kind of neat here with the model comparison is I can take all the models that I've run, and I can look at them, and see them in a tabular output. And the key thing is, is I want to look at the R square. And that's percent variation of price explained by our model. And then we have two other variable, two other results, which is our a C and A, D, which is our error terms. And we want them to be small, and we always want our r square to be large.
So if we look at this table, you can see here that the neural net, which I chose to run is the best one, or the best model for this particular situation. Now from model comparison, which I think is awesome, Ruth said that, you know, she's really into C code, and she would like to have this model scoring and C code. So if I go through model comparison Bring up the formula depot, I can save this modeling score code in C, and then I can make Bruce happy. Thank you, Mary, you've made me very happy. One other thing I want to point out here, because we're talking about which model to choose, we're not focusing really closely on how to select within this set of possible predictive models.
But comparing between the bootstrap forest and the generalized regression with the lasso, you can kind of see some of the reasons you might choose one or over the other as far as output. So the lasso that regression based model, you can see those straight lines sort of effects of bathrooms and lot size in your belt. And if we put in more interactions, certainly we can get curvature to those but we're going to get that regression style output for the bootstrap forest we can actually capture all kinds of non linearity.
So if you want to be able to interpret also, in addition to the predictive model, interpret the coefficients, the generalized regressions and it gives you that extra benefit if you want to be able to capture all kinds of regions of non linearity, a tree based method might be your better choice. So here's another point about how to choose within the predictive modeling. Good point roads. Good point. Thanks, Mary.
Okay, analysis number three, we want to quantify the effect of home prices on home prices from additional bedrooms. So that idea of how much should I expect to pay for an extra bedroom I'm home shopping. I see two homes in a similar similar area. Lots of similarities between them. Similar square footage similar lot size, but one has an extra bedroom. So how much should I expect to pay extra, and so maybe I'll find that every additional bedroom adds $97,644 to the Total Home cost.
Now, I'll tell you, I'm super interested in housing data, which is why we're using this as an example. I've looked at housing data from lots of different cities. I live in Orlando, Florida in Orlando that this estimate is actually closer to about $20,000. You should expect to pay about $20,000 extra for an extra bedroom. This estimate of 97,644 is from Cincinnati. So just in case you were curious if you live in Cincinnati expect an extra hundred thousand dollars per bedroom. Oh, it's gotta be more than that for Boston where I live. I get some of the reason for the lower price in Orlando is because homes are cheaper here. Okay, so Mary, it's on you.
Which type of which quadrant? Are we in here? Ah, man, this is a tough one rose. He didn't leave me any breadcrumbs. So I'm gonna say explain. You want to explain I agree with you. So again, there's some overlap between predictive models and explanation models, but we're talking about the emphasis being on do you want to predict really well, where our metrics are about how well we predict? Or do we want to explain using slopes and estimates in the model to see a meaning associating the factor to the response.
So explanation is exactly right in this case, so we're going to be in the Analyze menu in the fit model options for the explanation type models that regression kinds of models. Standard least squares is the default. If you've got Something continuous like home price that you're modeling. And so that's going to be ANOVA and regressions and that style of analysis, if we had something categorical that we were predicting, like if we wanted to predict getting an offer on the home versus not getting an offer, it would default to logistic regression for that.
Or we could switch over to generalized regression again, for more options. So we could combine the identify factors with the the modeling in the generalized regression. So in this case, because we've got a continuous home price as our response and we've got something continuous, we're predicting with a slope, we're actually doing a regression, you don't even have to know that to use jump, you just use this personality of standard least squares for any of those options. So here, if I fill out price as the response, the number of bedrooms as the effect, either alone or with lots of other variables, but that's the one that I'm focusing on.
So I might want to pull out that one slope at the end. I can either use that default standard least squares personality or I can switch over to generalized regression. All right, so Ruth, can we conclude them Like generalized regression is one stop. I think for many things, I think we can for any of you watching that haven't used generalized regression before generalized regression is only available in jump Pro. So we're trying to give you options that don't use that as well. But yeah, I think Mary, it's a pretty good tool.
Great. Great. All right, you ready? I'm ready for so far, you're ahead of me, because I'm a half up half a point behind. All right, we want to identify groups of home group of home that are similar similarly, based on the list of possible characteristics. For example, we want to identify market segments, home listings, from a database that might have certain segments or characteristics that are similar or grouping that we want to look at. So what's your call? Well, again, you gave me lots of breadcrumbs.
So I think one segment here so I think we're doing that segmentation that mark finding things that are similar across lots of variables, no response variable in mind. So I think we're in the clustering thing Ding, ding, ding. Yes. Yes. So the clustering menu in jump is under the Analyze. And we have various methods, different algorithms for clustering, and this one, we're going to talk about today's hierarchical clustering. We also could use the multivariate method if we chose to. Alright, so basically, I'm just going to put the variables in that or characteristics of the home that I'm interested in or want to see the relationship to these variables and similarities. And we come out with the output of the clustering platform. And on my left here, which I hope is your left is the dendogram.
And it shows us how things are been grouped together. We don't have the labels on here, but you can see there's probably six clusters different colors. So we can begin to look at certain similarities that are grouped together. To the right is another way it's a parallel plot. And what that's giving us every single line is a home in that that's being represented. So we have a lot of homes. And the way that you look at the parallel plot is the, the lower on the y axis, if you want is the lowest, and then the top of the y axis is the highest. So it goes low to high. And if we look across, it's another, you know, way to look at this hierarchical data. And I thought the one that kind of caught my eye was cluster three, and I looked at your year built, and that's pretty high and I thought, geez, that must be newer homes. Right?
And I looked at cluster two and that once I looked at your built in I said there must be older homes. So you can go across and look at the relationship. of each home, and how it's represented in each cluster across the characteristics that you chose to compare are grouped together. Yeah, and if we were doing this with some kind of market segmentation, so not looking at homes that were similar, but like people's buying behaviors that are similar, we'd maybe see how likely they are to buy certain products or how often they shop at certain stores. So that would help us see Oh, these are the people who really love grocery shopping, but don't do any clothes shopping, something like that. So we'd be able to identify what what these market segments are popping out. patterns.
Yeah, yeah. Okay. So in conclusion, you know, we said we had our four quadrants, What's your goal? What's your What are you going after? What trying to information are you trying to extract? So we have segmentation or clustering, which we shared with you. And that's once again, grouping. Yeah, and the Fit model platforms that we talked about for the explained goal where the standard least squares, logistic regression and generalized regression again, those are the personalities within that fit model platform that we suggested for that explained goal where you want to actually interpret those coefficients.
And the predict or predictive modeling is using tree methods, partition, random forest boosted trees, and, of course, neural nets, which is our blackbox approach that gives us lots of really good information. And, of course, you know, for the backseat driver, Gen reg, can always play in this predictive area. Yeah, absolutely. And then for the identify goal, we mentioned three different platforms, just a quick pass predictor screening as a really good tool. So that just tells you what things are very important. Again, it's using a bootstrap forest in the background. So another option is to go to the predictive modeling bootstrap forest, or to the fit model platform and use that generalized regression or possibly stepwise selection. If you want to Sort of select out things like two way interactions and three way interactions that aren't important. Great. Well, this is which model when?
About five more minutes if anyone has any questions. So Julian, if you'd like to let us know if there's any questions, we've got two hidden examples that we can pull up while we're waiting for questions, or we can take some questions, whatever you prefer to leave. Yeah, we can give a two minute pause if you like and let people ask questions. Or if you prefer, you can talk through some things while people are asking questions. Give a preference. Mary, let's go through one extra example while we let people get their questions in and then we'll take all right. Okay. This is the tiebreaker. prepared for this. So, all right, awesome. All right. Go for it. Okay. This is your chance to win back your points if you I got a guest right, though. All right. So we want to know how home prices might differ based on their location. So we have three locations city, rural and suburban.
And we want to find out something like how much is the difference on average. So for example, right being right downtown in the city is $20,000 more expensive than in the rural community. So we want to actually get that estimate and be able to interpret that estimate. So which scenario Do you think we want to interpret? Hmm? predict? So we are going to be predicting home prices, but I tried to write this question in the perspective of where the goal is interpreting the actual estimate. So I was aiming for the explanation. explanation. Well, I guess you're gonna give me that one. I'm gonna give it to you. Yeah. Well, I mean, a half a point here, half a point there. You're still two points to my when you're right. They didn't it is explaining. Yeah. And again, as we mentioned, there's a lot of interplay between these things. You might have multiple goals.
If you really want to be able to pull out those coefficients and interpret those. Then those statistical based models like ANOVA, and regression, those types of things are going to be good tools for you. Julian, do we have any questions? Yeah, one came in, where did you acquire your residential home data set? Great question, Bruce. You're that you're the you're the guru on the line. Not so I joined jump about four years ago and was so fortunate to join a team with Julian Parris, who is hosting us and Mia Stevens, who has certainly been on broadcasts in this past week talking about stips to fantastic people and I saw them give a talk about using housing data probably four years ago.
They're the ones who pass this on to me, but redfin.com is scrape data from so if you do want housing data, and you want to grab this and analyze it yourself if you're if you're in a US city. Red fin.com will give you easy access to download data for a US city. And I remember a jump discover, talk that Julian gave using red fin and did mapping and did all kinds of exploratory data visualization using the red fin data. And I was just like awestruck, and then with Mia, when they were putting in the casino in Boston, we looked at the the area around Boston to see where we should invest.
So yes, yeah, very cool. Julian, any other questions for us? Yeah, here's one. Do you have a rule of thumb on dealing with the magnitude of random error compared with the prediction results? Hmm. I would think that's an a personal preference, right? choosing what information you want to what's the question you're asking and what you're willing to accept? Yeah, I don't have a religion. If somebody else does that wants to post something in the questions you're welcome to, it's going to have to do with what you're comfortable with, like what amount of risk you're allowing.
So here's some output from, for example, the model comparison. The R squared is our first fit statistic there, that just tells us how much variation we're explaining based on the factors we're putting in our model that are a C and A D, are telling us essentially, the average error. So this is where you, if you had a rule of thumb, you could apply that rule of thumb, but it's going to be very specific to your scenario, sort of like r squared criteria are specific to your scenario, if you've got experimental data where you can control a lot of it, then you expect to be able to fit it really well with very little extra error.
So r squared 0.99 are reasonable in that scenario, whereas if you have just observational data, and there's a lot of sources of variation that you can't control, you're going to expect a much higher amount of error. So this r squared of point seven is actually pretty good in an observational data scenario. So I don't have a rule of thumb for you. I think it's going to have to do with your Specific how comfortable you are with those prediction intervals. But if you use the generalized regression you actually get those mean calm. That's one place where you can see Am I comfortable with this? Is this where I'm okay? So sorry, I don't have a rule of thumb but kind of a scenario specific. Yeah, it's more subjective.