topic Re: Conceptual question about bagged predictions in Discussions

Conceptual question about bagged predictions

dale_lehman — Fri, 09 Jun 2023 00:45:03 GMT

I have struggled for a long while looking for uncertainty measures from machine learning models that are comparable to the standard errors that you routinely get from regression models. Only recently I have become aware of some of the capabilities of the profiler - in particular, the bagged predictions. But I don't really understand how to (or if I should) interpret those bagged predictions. When I run a machine learning model (for example, a neural net) and I save the bagged predictions from the profiler, I get a Bagged mean, the standard error of the bagged mean, and the bagged standard deviation. Comparing these with a regression model (multiple regression, for example), I've observed the following relationships:

The predicted bagged mean from the NN is very similar to the prediction formula from the multiple regression.
Mean prediction intervals from the multiple regression model are much narrower than the individual prediction intervals as expected (in the example I am looking at, the standard error for the mean prediction is about 1/10 the size of the standard error of the individual predictions).
The standard error of the bagged mean from the NN is much smaller than the bagged standard deviation (about 1/10 the size in the example I am looking at).

These observations tempt me to think of the standard error of the bagged mean from the NN as analogous to the standard error of the mean predictions from the regression model. Similarly, the bagged standard deviation may be similar to the standard error of the individual predictions from the regression model.

However, the standard errors from the NN and the regression models do not resemble each other at all! So, my question is whether my interpretation makes any sense - or, exactly how can the standard errors from the bagged mean be interpreted or used.

Thanks in advance for any insights. I am attached an concrete example in case it helps with my question (this is the validation data set from my modeling example - with the predictions from the multiple regression model and NN included).

Re: Conceptual question about bagged predictions

dale_lehman — Mon, 31 Jan 2022 21:29:29 GMT

Addendum - in the file I attached, I (over)estimated the standard errors from the multiple regression. I estimated them by using the saved 95% confidence intervals. Attached is a revised version where I saved the standard errors directly. My question remains, as the standard errors from the NN seem to be less than half as large as from the multiple regression (on average) with little apparent correlation within the validation set. Both models fit the data quite well. So, I am wondering if the interpretation of the standard errors of the bagged mean are appropriately interpreted as standard errors associated with the predictions in the same way that the standard errors of the predictions from the multiple regression are interpreted.

Re: Conceptual question about bagged predictions

Mark_Bailey — Mon, 31 Jan 2022 22:22:15 GMT

Yes.

The Neural platform uses a complex model, even with only a handful of hidden nodes. Imagine adding lots of terms to the regression model. What happens to the RMSE? What, in turn, happens to the SEs? Is your regression model complexity comparable to that of the NN?

I am not suggesting that you are misusing the NN or other ML methods. Just remember that they are ALL about prediction, NOT AT ALL about inference. So uncertainty in estimates is unimportant, accuracy / reproducibility / generalization are everything. So prediction models (not explanatory models) use measures of the total MSE (bias + variance) to select the model, and the 'honest assessment,' or cross-validation, to confirm the model that was selected.

But the main point is that there is no reason for the SEs from different models to be the same, even if their mean response predictions are similar. This difference might give one model an advantage in your application.

Re: Conceptual question about bagged predictions

dale_lehman — Tue, 01 Feb 2022 00:23:43 GMT

Mark, thanks. That helps somewhat. It might explain why the NN gives smaller standard errors (though I'm still surprised at the size of the difference when both models have such good fits to the data). But it really doesn't seem to explain why there is almost no correlation in the standard errors associated with each prediction. The data I posted shows virtually no meaningful relationship between each observation's standard error of prediction between the two models. Now, for a multiple regression model I have some sense of what determines the standard errors associated with different observations - but for the NN, I really don't. Perhaps that is the reason they are not related to each other? Is this a dimension related to the lack of interpretability of NN models?

Re: Conceptual question about bagged predictions

Mark_Bailey — Tue, 01 Feb 2022 17:02:44 GMT

The NN is an ensemble model that is highly non-linear. Compare that with your regression model that might have second-order terms. The standard errors are very different for these two kinds of models.

I personally disagree with the notion that NN are not interpretable or that linear regression models are interpretable. (I think it is silly.) The effect of predictor X is a linear combination of all the terms that include it. For example, it is nonsense to talk about a 'quadratic effect.' There is only a quadratic term in the model. So how do we interpret the effect of X when it appears in the model as X + X*X2 + X*X + X*X*X? We are just used to thinking in these terms - we had many years of exposure to it and time to think about it. A NN is more of the same (linear predictor) put through a non-linear activation function, and added to more of the same for each node. We just have to think harder.

No, we don't. We have a profiler that works with any kind of function (model).

Re: Conceptual question about bagged predictions

dale_lehman — Tue, 01 Feb 2022 19:39:03 GMT

I'm following what you say - but I think my question has become something different. Let's leave NN out of it. Since the profiler is available from the multiple regression platform as well, I did some experimenting to see how the standard errors of the bagged predictions compare with those saved from the regression model. They are correlated, though far from perfectly. The individual prediction standard deviation is much larger than the mean prediction standard deviation, as it should be, and this also applies to the two standard errors you get when you save the bagged means. However, what surprises me and I don't understand, is why the standard errors from saving the bagged predictions (either the individual or mean version) are an order of magnitude smaller than the standard errors from the regression model. My understanding (which could be wrong) is that the standard errors of the predictions are theoretically derived in the regression model and are the result of bootstrapping in the bagged predictions. In theory, those should be similar (at least with enough bootstrap samples - I used 100 and 1000 and both give similar results) - but they are not.

So, it appears that the standard errors from the profiler are qualitatively different than the standard errors from the regression model. Why is that the case?

Re: Conceptual question about bagged predictions

dale_lehman — Tue, 01 Feb 2022 19:53:45 GMT

Trying to be more complete, attached is a simple little file with a single predictor. I built the regression model and saved the predictions and standard errors (both mean and individual). I used the profiler to save the bagged predictions. There are 3 xy plots showing the equivalence (perfect in this case) of the predicted values from the model and from the profiler. The standard error plots show strong correlations, but also show that the standard errors from the profiler are considerably smaller than from the model. I strongly suspect the standard errors from the bagging procedure mean something different. I hope you can clarify this for me.

Re: Conceptual question about bagged predictions

Mark_Bailey — Wed, 02 Feb 2022 15:18:04 GMT

"My understanding (which could be wrong) is that the standard errors of the predictions are theoretically derived in the regression model and are the result of bootstrapping in the bagged predictions. In theory, those should be similar (at least with enough bootstrap samples - I used 100 and 1000 and both give similar results) - but they are not."

Which theory are you referring to?

Re: Conceptual question about bagged predictions

dale_lehman — Wed, 02 Feb 2022 15:31:48 GMT

I believe the prediction confidence intervals for a multiple regression are derived via formulae that are derived from assumptions about the error structure and random sampling (ultimately derived from logic that underlies the Central Limit Theorem). On the other hand, I believe the bagged predictions are derived from bootstrapping - a nonparametric empirical approach to deriving confidence intervals. I also believe these two approaches are generally close, unless the underlying data has unusual distributions and/or insufficient sample sizes are used for the bootstrap.

In the files I attached, I don't think any of these issues arise. The mean predictions are almost identical from the 2 approaches, but the standard errors from the profiler are much much smaller than from the theoretically derived values. This is what makes me think I am misunderstanding what the profiler standard errors mean. Otherwise, why would anyone ever use the theoretically derived standard errors?

Re: Conceptual question about bagged predictions

Mark_Bailey — Wed, 02 Feb 2022 16:14:14 GMT

"I believe the prediction confidence intervals for a multiple regression are derived via formulae that are derived from assumptions about the error structure and random sampling (ultimately derived from logic that underlies the Central Limit Theorem). On the other hand, I believe the bagged predictions are derived from bootstrapping - a nonparametric empirical approach to deriving confidence intervals. I also believe these two approaches are generally close, unless the underlying data has unusual distributions and/or insufficient sample sizes are used for the bootstrap."

Yes! That is, if you use the theoretical expression for the CI and the bootstrap CI for the SAME linear regression model, the CI ESTIMATES should agree.

Re: Conceptual question about bagged predictions

dale_lehman — Wed, 02 Feb 2022 16:19:42 GMT

Mark

Then I would ask you to look at the last example dataset I posted. The confidence intervals (standard errors) are not even close to agreeing. That is for a simple linear regression and comparing the prediction standard errors with those that come from saving the bagged predictions from the profiler. The latter are much much smaller than the former. That is why I am confused.

Re: Conceptual question about bagged predictions

Mark_Bailey — Wed, 02 Feb 2022 16:52:40 GMT

Let's review how the bagging works in the Profiler. Note that it is based on the fitted model. It uses the fitted model and resamples the data to inflate the data set or sample size but does not alter the model. Here is an excerpt from the JMP Help that covers Profiler bagging:

"Bagging automatically creates new columns in the original data table. All M sets of bagged predictions are saved as hidden columns. The final prediction is saved in a column named “Pred Formula <colname> Bagged Mean”. The standard deviation of the final prediction is saved in a column named “<colname> Bagged Std Dev”. The standard error of the bagged mean is saved in a column named “StdError <colname> Bagged Mean.” The standard error is the standard deviation divided by Sqrt( M-1 ). Here, <colname> identifies the column in the report that was bagged.

The standard error gives insight about the precision of the prediction. A very small standard error indicates a precise prediction for that observation. For more information about bagging, see Hastie et al. (2009)."

So you are not comparing the SEs from a regression analysis and NN. You are comparing SEs from any model and the SEs from bagging in the Profiler with the same model. The difference will be a factor of Sqrt( M-1 ).

Re: Conceptual question about bagged predictions

dale_lehman — Wed, 02 Feb 2022 17:47:00 GMT

We are almost on the same page. Ignoring NN for the moment, and using the simple regression example I provided, the standard error from the bagged mean is about 10% of that from the regression model: this matches the square root of M-1 (M=100 here). That is true for the mean predictions - for the individual predictions it is about 20% (I'm not sure why that sq rt (M-1) applies on the former but not the latter, but I don't think that is very important). So, the question is: if I want a confidence interval for the mean prediction, which standard error do I use? The difference is an order of magnitude!

Assuming for the moment that I should use the standard error from the regression model (which is the larger of the two) - and that is what I suspect is the case - then it raises the question of what the bagged predictions are good for. Here, I think the NN (and other machine learning models) comes in - there is no standard error from many of these models without some type of empirically derived one, such as provided by the profiler. So, it would seem very useful to use that standard error from the bagged predictions to construct confidence intervals for those predictions. However, from the regression example, I am wondering if this might underestimate the degree uncertainty by an order of magnitude - which would not be so useful.

Re: Conceptual question about bagged predictions

Mark_Bailey — Wed, 02 Feb 2022 18:02:23 GMT

"So, the question is: if I want a confidence interval for the mean prediction, which standard error do I use? The difference is an order of magnitude!" Use the SE from the model that was used to predict. If you use the saved regression model, then use the saved SE or CI. If you are using the bagged predictions from the Profiler, then use the bagged SE. Here is the result using :weight versus :height in the Big Class data set. The highlighted pairs of prediction and SE data columns would be used together.

Re: Conceptual question about bagged predictions

dale_lehman — Wed, 02 Feb 2022 18:16:26 GMT

Surely that can't be right! Your example looks just like mine - the means are almost identical but the standard deviations differ by a factor of 10 (due to the sq rt (M-1) depending on which sets of columns you use. So, while I see the logic of pairing the mean prediction with its associated standard deviation (either from the model or from the bagging), the practical effect is to have roughly the same mean predictions, but one confidence interval ends up being 10% as wide as the other. Which one is the appropriate measure of variability for the mean prediction? It can't be both - unless they answer different questions. And, if that is the case, can you tell me what question each answers?

Re: Conceptual question about bagged predictions

Mark_Bailey — Wed, 02 Feb 2022 19:36:21 GMT

"Surely that can't be right"

OK, you got me. It is all made up, faked. I was just seeing how far I could string you along.

(Serious discussion resumes...)

"Which one is the appropriate measure of variability for the mean prediction?"

The one that was estimated for prediction you will use. The pairing thing...

Bagging is more about predictive modeling than explanatory modeling, as I explained. Bagging decreases the uncertainty in the prediction. The use of bagging in this case relies on your belief in the quality and validity of the data and the model. it is not cheating. If the model fails, it is because of a problem with the quality or validity of the data or the model.

"It can't be both - unless they answer different questions."

That is exactly what I have been saying.

"And, if that is the case, can you tell me what question each answers?"

I can. I actually did: again, the first pair answers the question about the uncertainty in the prediction of the original model (e.g., linear regression, neural network, partition). The second pair answers the question about the uncertainty in the prediction using bagging.

Re: Conceptual question about bagged predictions

dale_lehman — Wed, 02 Feb 2022 19:57:23 GMT

We are converging.

"And, if that is the case, can you tell me what question each answers?"

This says that I have more uncertainty about my model predictions than I have about the bagged predictions - a lot more. Why would anyone use the model predictions and their confidence intervals then? I realize that a smaller standard error is not always good - only if the underlying model is good. But in the case we are looking at, the same model underlies both measures and the mean predictions are almost identical. So, why would I choose the much wider confidence interval?

I suppose there is the issue of coverage - the narrow interval might not provide enough coverage of the true value. I will try some simulations to see if I can shed any light on that - but do you know of any references that speak to the accuracy of the two standard error measures relative to each other?

Re: Conceptual question about bagged predictions

dale_lehman — Wed, 02 Feb 2022 20:24:55 GMT

Attached is a simulated example. The untitled dataset contains simulated data and a regression model based on a random sample of 100 of the 1000 rows. I saved the standard errors from the model and from the bagged predictions. The subset of untitled data set then contains the 900 rows not in the random sample. Using approximate 95% confidence intervals (2 standard errors around the corresponding mean prediction, using the standard errors for the individual predictions), the coverage of the actual Y value was 866 out of 900 rows (around 95%) for the model predictions, but only 222 out of 900 for the bagged predictions. I couldn't figure out how to generate a comparison of coverage of mean confidence intervals since I only have a single Y observation on each row.

Given how extreme the results are, this suggests to me that the standard errors from the model (at least for this well behaved model) are accurate measures of the uncertainty in the predictions. But the bagged standard errors are too small. Given the simplicity of this example, it sure seems like I wouldn't want to rely on the standard errors from the bagged predictions. Do you think that is a reasonable conclusion here?

Now, to the real potential uses. If I run a classification model using NN, random forests, boosted trees, etc., one shortcoming compared with logistic regression is that these machine learning models do not provide a measure of uncertainty in the predictions (without invoking another procedure such as conformal prediction, which I have been playing with). The profiler could readily provide me with bagged predictions of the mean probabilities of classifications along with their standard errors. As useful as that would be, I am inclined to say that I can't really use those standard errors to represent the uncertainty in these machine learning models. Is that correct? Perhaps a more general question is, what can I use the bagged standard errors for?

Re: Conceptual question about bagged predictions

Mark_Bailey — Wed, 02 Feb 2022 20:37:39 GMT

"So, why would I choose the much wider confidence interval?"

Because you question the quality or the validity of the data or the model.

I would likely not use bagging and its predictions with a screening experiment because the model is likely biased. I would likely not use bagging and its predictions with a model based on a small sample of observational data.

Let me be clear. Bagging is valid. It is not cheating. But it is not always appropriate or beneficial.

Unfortunately, bagging in the profiler is something JMP developed. I do not have external references.

Re: Conceptual question about bagged predictions

dale_lehman — Wed, 02 Feb 2022 21:13:14 GMT

In my simulated example, I have no reason to question the quality of the data or appropriateness of the model. It is not observational data, nor is it a small sample (true, n=100 out of a population of 1000 is not large, but the confidence interval coverage is so disparate that I think the example shows us something). I won't say bagging is cheating (a loaded term). But I don't feel like I can trust the standard errors that it provides even for my simulated case. I am very reluctant to use it for a case where the data and model are more suspect. Can you provide any guidance for where it can be used? I don't mean to be antagonistic: I love JMP and love the profiler, I'm just trying to see whether the confidence intervals it can provide are useful.