Re: Prediction screening

Report Inappropriate Content · Jun 8, 2023 5:39 PM

Listened with great interest to the lecture on predictive models, and the suggestion to saved the Prediction formula to the dataset then re-run the Prediction screening with initial variables as X and adding the prediction formulas as new X's.

Today I discussed with a colleague, and his reaction was “ talking about over-fitting the models!"

Will this approach over-fit models even more (even if we use different validation data from data creating the models). My colleague's opinion was that we use the same data twice.
We are working on binomial Y (Health or Sick), and the approach indeed improves prediction, but have we just introduced more over-fitting such that even the validation is now incorrect/over-fitted? #prediction, #over fitting, #screening

Christer Malm

dale_lehman · Oct 5, 2021 09:03 AM

I haven't watched this and I'll be interested to see the responses, but I do have an initial reaction. Using the prediction formula from the Model Screen platform and re-running that platform with the prediction formula along with other predictors does sound a bit like over-fitting. However, I think it is a way to build ensemble models - if the prediction formula comes from one model and the new model uses that output (along with other predictors) to create a new model, then it is an ensemble model which may be superior to the first model without overfitting. The issue of what validation data to use seems more subtle to me. I would use the same validation data, otherwise you really can't tell if the new model is better than the old one. Perhaps you could save both prediction formulas and then apply these to a different validation set to see if performance is enhanced or not.

gail_massari · Oct 5, 2021 11:08 AM

Hi @dale_lehman . Andrea Coombs @andreacoombs1 presented the session on Predictive Modeling and may have some thoughts on this.

SDF1 · Oct 5, 2021 10:01 AM

Hi @ChristerMalm ,

Yesterday's unsession on predictive modeling was very informative. I don't recall the exact discussion on that topic, but I would be cautious in using a prediction formula for the outcome as one of the inputs for the same outcome. My understanding about ensemble modeling is that different models are built, e.g. NN, boosted tree, XGBoost, etc. and then they're averaged together. You can do this through the Model Comparison platform by using the red hot button, or you can write a column formula that does it for you. The ensemble model should fit the data better than any one model by itself, but it's not always the case. One thing to keep in mind, it really depends on what the end goal is and how important interpretability is for the model. If understanding the model equation and how inputs translate to outputs, then model ensembles and complicated non-linear models might not be the best way to go.

I think one of the things that Russ Wolfinger was talking about that's really important is finding the right validation scheme for your data. If the wrong set of training and validation data is chosen, then you might get terrible fits, even though the data is good, like the Nature paper example he talked about. It probably would be best to spend more time on determining the best validation scheme and then working on generating the model.

It also sounds like you might have an imbalanced data set, meaning much more of one occurrence than another. In that case, you'll definitely want to explore changing the logistic probability threshold to see if that helps your prediction. Also, maybe consider the profit matrix approach to assign value to the cost of false positives and false negatives. Chris Gotwalt also demonstrated an interesting approach of using decision trees to help with better understanding cutoffs with imbalanced data.

Hope this helps,
DS

andreacoombs1 · Oct 6, 2021 09:23 AM

@ChristerMalm and @dale_lehman. The ensemble model I presented in that recording was likely overfit because the ensemble model was built using the same data that was used to build the original model. (I will continue with the "original model" and "ensemble model" terminology to keep things straight in my explanation). One strategy to prevent this overfitting is to take the entire data set and break it into two groups: Group A for building the original model and Group B for building the ensemble model. You can hide and exclude Group B while building the original model. Split Group A up into training, validation, and test to build candidate models and choose the champion model. Now bring Group B back into the mix and use the entire Group B for training. You can use Group B as validation or split it up into validation and test. This should protect against overfitting.

dale_lehman · Oct 6, 2021 10:40 AM

I hadn't actually watched the video until now. After seeing it, your response puzzles me a bit. It seems like 2 things are getting mixed - which are hard to disentangle. Certainly if you create potential predictive models and expose them to new data (e.g., your group B), you may discover whether those original candidate models are still the best, or whether additional tuning or different models might improve the predictions when faced with new data - classic ways to avoid overfitting. However, if you run the ensemble model, along with the original predictors, on the same data, I would think that overfitting is not the only problem. I'm not sure what the sense is in that process at all - by including the original predictors along with some chosen models from the screening, aren't you just confirming whether the original chosen models should have been chosen to begin with? And, if the purpose is to see how your model(s) work on new data, then does this really have anything to do with ensemble modeling at all? Isn't the idea of holding out a test data set (or perhaps even a second test data set) to see how your model performs out of sample - regardless of whether it is an ensemble model or not. In other words, the out of sample testing seems to have little or nothing to do with the model screening platform or ensemble modeling - it is always an issue with any predictive model(s).

ChristerMalm · Oct 10, 2021 03:52 PM

Thank you all for the feedback on my question!

As a comment on your issues @dale_lehman - the project is to validate a set of selected biomarkers for doping on novel datasets (blinded blood samples that may or may now be drawn from doped subjects). Thus, we need to fix the model using Dataset A from known samples, and the ensemble approach seems like a way to go then apply the model to Dataset B (blinded) and investigate the prediction outcome on Dataset B from the formula created using Dataset A.

We can, if needed, then go back and exchange some of the biomarkers and perform this procedure over again, until we reach a pre-set predictive power.

Now that I have (hopefully) explained the project a bit better, I can see how I would use Group B for validation, but not sure I understand how to (again) train the model with B, (remember I'm not a statistician) as suggested by @andreacoombs1.

I don't know if this alters your responses, just wanted to clarify.

Christer Malm

dale_lehman · Oct 10, 2021 05:18 PM

I'm not sure that I understand you. When you say that dataset B is "blinded" do you mean that you don't know which subjects are doped (or at least you are pretending you don't know)? If so, then B appears to be your test data set. Normally, if you have enough data, you would split A into training and validation data, train your models (including ensemble models) and then apply them to B. The idea that you would then, on the basis of how they perform on B, go back and change your models seems to invite overfitting. It is analogous to training on A and using B as your validation set. But then you have no test data at all - what I mean by test data is data held apart from the modeling process completely.

Prediction screening