Discussions

dale_lehman · Jun 8, 2023 5:45 PM

I have struggled for a long while looking for uncertainty measures from machine learning models that are comparable to the standard errors that you routinely get from regression models. Only recently I have become aware of some of the capabilities of the profiler - in particular, the bagged predictions. But I don't really understand how to (or if I should) interpret those bagged predictions. When I run a machine learning model (for example, a neural net) and I save the bagged predictions from the profiler, I get a Bagged mean, the standard error of the bagged mean, and the bagged standard deviation. Comparing these with a regression model (multiple regression, for example), I've observed the following relationships:

The predicted bagged mean from the NN is very similar to the prediction formula from the multiple regression.
Mean prediction intervals from the multiple regression model are much narrower than the individual prediction intervals as expected (in the example I am looking at, the standard error for the mean prediction is about 1/10 the size of the standard error of the individual predictions).
The standard error of the bagged mean from the NN is much smaller than the bagged standard deviation (about 1/10 the size in the example I am looking at).

These observations tempt me to think of the standard error of the bagged mean from the NN as analogous to the standard error of the mean predictions from the regression model. Similarly, the bagged standard deviation may be similar to the standard error of the individual predictions from the regression model.

However, the standard errors from the NN and the regression models do not resemble each other at all! So, my question is whether my interpretation makes any sense - or, exactly how can the standard errors from the bagged mean be interpreted or used.

Thanks in advance for any insights. I am attached an concrete example in case it helps with my question (this is the validation data set from my modeling example - with the predictions from the multiple regression model and NN included).

Mark_Bailey · Feb 2, 2022 04:48 PM

"Using approximate 95% confidence intervals (2 standard errors around the corresponding mean prediction, using the standard errors for the individual predictions), the coverage of the actual Y value was 866 out of 900 rows (around 95%) for the model predictions, but only 222 out of 900 for the bagged predictions."

Are you confusing the confidence interval for the mean and the confidence interval for the individual observations? The former is an interval in which we expect the mean response to be located assuming that the model is unbiased. Why would you expect individual observations to occur in that interval?

"Given how extreme the results are, this suggests to me that the standard errors from the model (at least for this well behaved model) are accurate measures of the uncertainty in the predictions."

No, I do not think your conclusion is reasonable. You are comparing 'coverage' based on individual responses, not the mean. The model is about location: the mean response. The model predicts the mean response. The confidence interval is about the uncertainty in the mean response. The CI for the model says that if you were to sample over and over again, that 95% of the models would have a prediction interval that contains the true response mean. It says nothing about individual observations.

"If I run a classification model using NN, random forests, boosted trees, etc., one shortcoming compared with logistic regression is that these machine learning models do not provide a measure of uncertainty in the predictions"

Prediction is about the future. Performance of predicting future observations is the basis for 'honest assessment.' Cross-validation is one of the best ways to honestly assess the model during selection and the final model after selection. We examine the validation hold out set or the test hold out test to assess the uncertainty. These models are all about prediction, not inference. That is not to say that they are not useful as explanatory models, but they might not completely satisfy you that way. So for example, what is the R square for the same validation hold out set across all the models. (Note that I am not using R square to select a model. I am using R square to evaluate selected models of various types.)

dale_lehman · Feb 2, 2022 06:38 PM

My coverage data was from the individual prediction standard errors - I save them from the regression model and used the bagged standard deviation from the profiler. To look at the mean, I would have used the Std Err Predicted and Std Error for the bagged mean. As I said (read what I wrote), I don't know how to compute coverage for the mean prediction intervals since I only have one observation for each row - if there is a way to do that, please tell me.

So, I used the individual predictions and JMP provides standard errors for the individual predictions as well as mean predictions (of course, the former are much larger than the latter).

Your other comment does not address my issues. I know several ways to compare and evaluate the models. what I am interested in is ways to examine the uncertainty associated with the models - not their predictive accuracy, but the variability of their predictions. The only method I know of is conformal prediction. It looked like the profiler provides such capability by providing confidence intervals for the features in the model. In fact, you appear to have suggested this yourself by objecting to my claims of lack of interpretability of machine learning models, such as NN.

I think the profiler makes a big step towards interpretability. It shows how varying the features affects the response variable. If it also can provide standard errors (confidence intervals) for these effects, then it essentially provides just as much interpretation as regression models. However, that is what I am struggling with - I am not convinced I can use those confidence intervals in that way.

dale_lehman · Feb 1, 2022 02:53 PM

Trying to be more complete, attached is a simple little file with a single predictor. I built the regression model and saved the predictions and standard errors (both mean and individual). I used the profiler to save the bagged predictions. There are 3 xy plots showing the equivalence (perfect in this case) of the predicted values from the model and from the profiler. The standard error plots show strong correlations, but also show that the standard errors from the profiler are considerably smaller than from the model. I strongly suspect the standard errors from the bagging procedure mean something different. I hope you can clarify this for me.

peng_liu · Feb 4, 2022 3:40 PM

Let me try to explain what Standard Error of Bagged Mean is, why it appears to be smaller.

First I try to save bagged predictions with just 3 bootstrap samples.

Here is the result. There are 6 new columns.

Let's look at what are these columns.

Every column of the first three is a prediction column from a bootstrap sample of the data. We have three bootstrap samples, so we have three of them. Here are three prediction formulas. And they are different.

(A)

(B)

(C)

Now look at the 4th column: Pred Formula Sales Bagged Mean 2.

(D)

So it is the average of those three predictions by rows.

Now look at the 5th column: StdError Sales Bagged Mean 2.

(E)

What is it? It looks like standard error of for some sort of mean. Mean of what? Mean of bootstrapped predictions. By words, this is not the same as StdError of the predicted mean. The predicted mean is just one of the many bootstrapped samples, if you change the perspective of looking at it. That should explain why StdErr of Bagged Mean is smaller. And it will get much smaller if you increase the number of bootstrap samples.

Now if we look at the last column, and here is its formula:

(F)

This is the standard deviation of bootstrapped predictions. This value, indeed, should be close to "StdErr Pred Sales", and it won't get smaller as number of bootstrap samples get large. This is the bootstrapped version of your "StdErr Pred Sales". If you suspect the model is not a good fit to your data, e.g. assumptions might be violated, such that "StdErr Pred Sales" no longer has the asymptotic distributions by the book, the bootstrapped version might give you something that is more honest, if you have to use the wrong model for prediction.

So why should you care about "StdError Sales Bagged Mean 2." (E) up there? This is quantity is to assess whether you have enough bootstrap samples.Ideally, you want the values very small, then you may trust your bagged values.

dale_lehman · Feb 5, 2022 11:48 AM

Thank you. I have reached the same conclusion. The remaining question, then, is which should be used to produce confidence intervals for the model effects. I believe it is the Bagged Standard Dev and not the Standard Error of the Bagged Mean - and I also believe JMP has been using (at least in the instructions) the latter to produce confidence intervals rather than the former.

Please either confirm this - or not. To elaborate further, I think the Standard Error of the Bagged Mean is useful for gauging whether your have enough bagged (bootstrap) samples to be satisfied. But for gauging the uncertainty in the effect of your factors on your response variable, the Standard Error of the Bagged Mean does not provide any useful information. It is the Bagged Standard Dev that does that. So, to produce the confidence intervals, after saving the bagged means, the Profiler should be opened, the Bagged means and Bagged standard deviations selected, the box for expand intermediate formulas selected, and the result will be the desired confidence intervals. Note that this differs from the JMP help:

https://www.jmp.com/support/help/en/16.1/?os=win&source=application&utm_source=helpmenu&utm_medium=a....

peng_liu · Feb 5, 2022 03:23 PM

Yes, I confirm that in Profiler platform (under Graph menu), when a Prediction Bagged Mean formula, accompanied by a StdError of the Bagged Mean formula, enters into the launch dialog, the Profiler platform use "StdError of the Bagged Mean" to produce intervals.

I agree that there are things are not clear enough, and could be improved. The things that I see not clear is the use of "Confidence Intervals" in the documentation on the page that you pointed to: Example of Confidence Intervals Using Bagging . It is not clear about confidence intervals of what, and the wording is probably misleading.

Meanwhile, your desire of using Bagged Std Dev to produce intervals is a good request. I tried to put that column into Profiler platform, and it won't recognize.

Would you please reach JMP Tech Support at support@jmp.com and report your concerns and request?

Thank you very much for looking into the software so closely!

dale_lehman · Feb 5, 2022 04:02 PM

Thanks. I think these confidence intervals will be extremely useful - after they are corrected. At present, they are quite misleading (by an order of magnitude - if you use 100 bootstrap samples).

Discussions

Conceptual question about bagged predictions

Re: Conceptual question about bagged predictions

Re: Conceptual question about bagged predictions

Re: Conceptual question about bagged predictions

Re: Conceptual question about bagged predictions

Re: Conceptual question about bagged predictions

Re: Conceptual question about bagged predictions

Re: Conceptual question about bagged predictions

Recommended Articles