Discussions

Mathej01 · Jun 14, 2024 03:25 AM

Hello,

So I fitted a regression model. And one of the predictors have a Pvalue of 0.0538 for the whole model. Should I completely take of this predictor and use the reduced model for predictions or is it ok to keep it ?

And what is the terminology for Pvalues that is around 0.05. I saw in some web sources, 0.05 to 0.1 can becalled marginally significant. Does this term make any sense in when explaining statitsical models. Please sharte your insights.

Thanks in advance

Jerry

P_Bartell · Jun 14, 2024 06:50 AM

Here's my suggestion wrt to whether or not to keep the effect with a p - value of 0.0538...try it both ways. Then answer the question 'Which model helps me solve my practical question?' And go with that model. There have been numerous papers and discussions around leaving terms in a model or taking them out. Ultimately it's up to you. There is no 'one size fits all' solution to this particular question. There is nothing sacred or sacrosanct around a significance threshold of 0.05. It's just a value many people use by rote rather than any other reason. You could just as easily select 0.01, 0.1 or any other value. As for the terminology I'm not worrying about that one...all I'd say is something like, 'Effect x is not significant at a p - value of 0.05.' and leave it at that. Significance is not a cliff...and p - values tend to push people's thinking into that cliff mentality.

View solution in original post

Victor_G · Jun 14, 2024 5:36 AM

Hi @Mathej01,

You might be interested by the answers provided by other members of this Community on this very closely related topic removing terms from a model following a designed experment

There is not enough context or information to help you, but here are some questions, remarks and considerations about model evaluation, comparison and selection :

What is your data source and how data was collected ? Designed experiments, historical dataset, ... ?
Depending on your response, you may have more or less confidence in the p-values found. For historical dataset, I would run the Evaluate design platform (used for DoE) to assess available power for the different terms in the complete assumed model. This could give you indications about the "reliability" of p-values found, and also about those from non-significant terms (because of a very low power available !).
What is your objective with this model ? Causal explanations, prediction & optimization, both (also linked to the available dataset and collection method) ?
When comparing models, statistical significance may be one interesting metric, but not the only one (always add domain expertise !). Complementary model's estimation and evaluation metrics, like log-likelihood, information criteria (AICc, BIC) or model's metrics (explanative power through R2 and R2 adjusted, predictive power through MSE/RMSE/RASE, ...) offer different perspective and may highlight different models.
You can then select based on domain expertise and statistical evaluation which one(s) is/are the most appropriate/relevant for your topic, and choose to estimate individual predictions with the different models (to see how/where they differ), and/or to use a combined model to average out the prediction errors.

In general, avoid the "cult of statistical significance" for several reasons :

There are other available and complementary metrics to help you compare and choose the most relevant model(s), and there are many situations in which p-values could be distorded and should not be blindly trusted (for example with mixture designs and the use of model with no intercept).
Removing a term because of a specific threshold may be a dangerous practice. In your example, should really the predictor with a p-value of 0.0538 be discarded because it didn't reach 0.05 ? It sounds too close to reject it, even if higher than 0.05 (and I'm not even questioning here the specific threshold value...). And depending on the power you have for this predictor analysis, this could explain why you didn't reached the significance threshold. I would prefer keeping non-significant terms in my model than discarding a possible (but not yet detected) significant effect.
"All models are wrong, but some are useful" is an adequate quote when dealing with multiple possible models and trying to evaluate and select some to answer a question. There might be several equally "statistically good" models in your case, so you should also always use domain expertise to sort the models and help the selection. You can also select several models and average/combine them.
Don't confuse statistical significance with practical significance, and use both to choose your model(s) : Since you want your model to be useful, you should also look at practical significance of this predictor : how much does it influence on the response mean (effect size) ? If this predictor has a strong effect size, it may be wiser to keep it in your model.

Hope this complementary answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

P_Bartell · Jun 14, 2024 06:50 AM

Here's my suggestion wrt to whether or not to keep the effect with a p - value of 0.0538...try it both ways. Then answer the question 'Which model helps me solve my practical question?' And go with that model. There have been numerous papers and discussions around leaving terms in a model or taking them out. Ultimately it's up to you. There is no 'one size fits all' solution to this particular question. There is nothing sacred or sacrosanct around a significance threshold of 0.05. It's just a value many people use by rote rather than any other reason. You could just as easily select 0.01, 0.1 or any other value. As for the terminology I'm not worrying about that one...all I'd say is something like, 'Effect x is not significant at a p - value of 0.05.' and leave it at that. Significance is not a cliff...and p - values tend to push people's thinking into that cliff mentality.

dlehman1 · Jun 14, 2024 07:52 AM

Let me add to P_Bartell's sound advice. Not only is nothing sacred about p=0.05, I would advise against any threshold value without doing some kind of decision analysis. If you use a threshold (and realize that perhaps none is necessary - usually, I'd advise to build the model and report the results, regardless of p values) to make binary decisions (such as keep the variable in your model or not), realize that either way there is some possibility of error involved. You should think about the relative costs of errors before deciding what cutoff to use.

Beyond that, cutoffs are generally a bad idea. I don't advise dropping "insignificant" variables from regression models - ever. The variables presumably were there for a reason. You thought they should matter. If the data is insufficient to reveal their influence, then so be it. Dropping the variable amounts to deciding the variable has no influence, something you don't have evidence for. Further, all the other variables in your model will then have their coefficients changed - are you prepared to say these will now be more accurate than they were before?

Despite the widespread use of p=.05, you should check the ASA controversy over p values and some of their recent statements about this. Some psych journals have gone so far as to banish p values. Many journals will still reject papers that don't have p values < 0.05. In the midst of such confusion, here is my own opinion. I want to know the p value - I believe it is an indication of how strong the signal is regarding that variable, in the data I have (note that even that statement is open to considerable criticism). I prefer not to see any binary choices made: I try to never say variable X does or does not influence outcome Y. It is always a matter of uncertainty and strength of evidence. The p value is part of that evidence, and when it is high it is telling you something. Omitting a variable due to its p value is deciding something that is not evidence-based (it is tantamount to saying that variable X does not matter). If you have few observations relative to the number of potential factors, then you can use Predictor Screening to assist with choosing which ones to use, but there is nothing you can do that will change the fact that your data may not be sufficient for the models you would like to build.

P_Bartell · Jun 14, 2024 08:20 AM

Thanks for echoing my sentiments exactly. When I used to teach hypothesis testing I would introduce the concept of p-values and statistical significance with a little exercise with my class. I'd take out a coin and tell the students 'We're gonna do a little test. I'm gonna start the test by telling you that I believe I have a fair coin. I'm gonna flip the coin and tell you the results of the toss. Heads or tails. I want you to write down in front of you the toss on which you are willing to challenge my belief of a fair coin.' Then I'd start tossing the coin...each time telling the students it was 'heads', even if it wasn't a heads. My point was to get them to determine when the 'significance threshold' is crossed for them and them alone. Usually by the eighth toss or so, everybody has 'rejected' my belief. Then I start with this line of questioning, 'Tell me the toss you were finally convinced the coin was indeed, not fair, by show of hands. One toss? (Usually nobody raises a hand) Two tosses? (Rarely) Three tosses? (Might get a few) Four tosses? (Now I've got many) Five tosses? (Mostly everyone) Six tosses (Only a few holdouts) Then we have a conversation around how the notion of 'significance' is really up to the eyes of the decision maker...not some magical probabilistic number. I tell them hopefully you've now learned all there is to know about statistical significance, these things we'll be calling p - values, and how decision making is actually done by people...it's a degree of belief...and that my friends is not a statistical value.

Victor_G · Jun 14, 2024 5:36 AM

Hi @Mathej01,

You might be interested by the answers provided by other members of this Community on this very closely related topic removing terms from a model following a designed experment

There is not enough context or information to help you, but here are some questions, remarks and considerations about model evaluation, comparison and selection :

What is your data source and how data was collected ? Designed experiments, historical dataset, ... ?
Depending on your response, you may have more or less confidence in the p-values found. For historical dataset, I would run the Evaluate design platform (used for DoE) to assess available power for the different terms in the complete assumed model. This could give you indications about the "reliability" of p-values found, and also about those from non-significant terms (because of a very low power available !).
What is your objective with this model ? Causal explanations, prediction & optimization, both (also linked to the available dataset and collection method) ?
When comparing models, statistical significance may be one interesting metric, but not the only one (always add domain expertise !). Complementary model's estimation and evaluation metrics, like log-likelihood, information criteria (AICc, BIC) or model's metrics (explanative power through R2 and R2 adjusted, predictive power through MSE/RMSE/RASE, ...) offer different perspective and may highlight different models.
You can then select based on domain expertise and statistical evaluation which one(s) is/are the most appropriate/relevant for your topic, and choose to estimate individual predictions with the different models (to see how/where they differ), and/or to use a combined model to average out the prediction errors.

In general, avoid the "cult of statistical significance" for several reasons :

There are other available and complementary metrics to help you compare and choose the most relevant model(s), and there are many situations in which p-values could be distorded and should not be blindly trusted (for example with mixture designs and the use of model with no intercept).
Removing a term because of a specific threshold may be a dangerous practice. In your example, should really the predictor with a p-value of 0.0538 be discarded because it didn't reach 0.05 ? It sounds too close to reject it, even if higher than 0.05 (and I'm not even questioning here the specific threshold value...). And depending on the power you have for this predictor analysis, this could explain why you didn't reached the significance threshold. I would prefer keeping non-significant terms in my model than discarding a possible (but not yet detected) significant effect.
"All models are wrong, but some are useful" is an adequate quote when dealing with multiple possible models and trying to evaluate and select some to answer a question. There might be several equally "statistically good" models in your case, so you should also always use domain expertise to sort the models and help the selection. You can also select several models and average/combine them.
Don't confuse statistical significance with practical significance, and use both to choose your model(s) : Since you want your model to be useful, you should also look at practical significance of this predictor : how much does it influence on the response mean (effect size) ? If this predictor has a strong effect size, it may be wiser to keep it in your model.

Hope this complementary answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

Discussions

Statistical Significance

Re: Statistical Significance

Re: Statistical Significance

Re: Statistical Significance

Re: Statistical Significance

Re: Statistical Significance

Re: Statistical Significance

Recommended Articles