It’s World Statistics Day! To honor the theme of the day, the JMP User Community is having conversations about the importance of trust in statistics and data. And we want to hear from you! Tell us the steps you take to ensure that your data is trustworthy.
Choose Language Hide Translation Bar
Highlighted
ron_horne
Super User

Parameterization choice and model comparison in logistic regression.

Dear members of the forum,

I am analyzing this data set with a dependent binary variable (Target), using a logistic regression for hypothesis testing rather than prediction on this real-life data.

I would like to utilize the model parameterization in order to assess whether the trend over time (years) is stronger (constant) or perhaps the momentary change is more meaningful as a breaking point.

When used separately, models are similar with perhaps better fit using the continuous Year variable. On the other hand, the technical change is the reason for modeling in the first place so can’t be ignored.

Once put together with the interaction all parameters are insignificant (perhaps multicollinearity, small sample size, too much variance or all together) which gives the false impression of no change in propensity over time.

Please do let me know if you have any suggestions in terms of model specification or comparisons for reaching a clearer conclusion.

Attached is the data table with a script for the alternative models.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Parameterization choice and model comparison in logistic regression.

You use the Logistic Regression platform. Use date, not year. Create a binary variable for period (before, after). Fit a model where the linear predictor is date, period, and period*date. Change the period values based on date. (Imagine that the observations are sorted by date in ascending order. Set all rows to period = after.for the first fit. Now advance period to the second date, so period = before for row 1 and the remainder remain set as period = after. Progress until period = before for all rows.. Collect the AICc criterion for each fit. Plot the criterion versus date to see where the change occurred. Evaluate that model as you normally would.

 

Note that you might want to avoid the boundary issue with all rows having the same level for period.

 

I am not sure if this approach will help. Just an idea.

Learn it once, use it forever!

View solution in original post

6 REPLIES 6
Highlighted
dale_lehman
Level VI

Re: Parameterization choice and model comparison in logistic regression.

When I look at your models, it looks like year and change are highly significant individually.  When you put the cross (interaction) effect, none of the parameter estimates look significant, but the overall model is.  But I don't understand the sense of putting an interaction between year and change in your model - the change variable just reports whether the year is before or after some intervention time, so it doesn't make much sense to me to put the interaction.  I suppose what you are asking for is whether the year has a different effect on the target in the Pre and post periods.  I think the more direct way to examine this is to Fit Y (target) by X (year) and put the change (Pre and Post) in the By box.  When you do this you will see that year is far from significant in the Pre period but quite significant in the Post period.  I think that answers your question.  If you need a test of whether the slope (nonlinear here) is significantly different in the two periods, I'm not exactly sure what the appropriate test is - but I think it is evident that the effect of year differs significantly in the two periods.

Highlighted
ron_horne
Super User

Re: Parameterization choice and model comparison in logistic regression.

Thank you @dale_lehman  for the advice.

 

That is the thing, Ideally, I would be able to assess whether there is an overall trend over time as well as a braking point at 2005. Therefore, the use of both the year and the dummy variable were supposed to allow exactly that. This way I would have an estimate for the slope in each period and well as a dummy for continuity between them.

I can estimate the slope for the post period directly by reversing the value ordering of “Change” as in the attached output. I get the same insignificance. Splitting the data, I get an overall significant model with an insignificant slope. Which to me shoes this is trend not very robust – am I right?

From all the alternative models, could you suggest a logical path do determine whether there is a trend in each period and is there a braking point (change in trend).

Highlighted
dale_lehman
Level VI

Re: Parameterization choice and model comparison in logistic regression.

I think you are too focused on finding a low p-value.  It looks like your data shows what you are looking for - the year has no clear effect prior to 2005, but a significant effect after.  This is readily seen by running two separate logistic regressions - target as a function of year, in the pre- and post- periods.  What more do you need?

Highlighted

Re: Parameterization choice and model comparison in logistic regression.

I hope that this reply follows the first three!

 

The problem sounds like 'change point analysis' and the approach is very similar to those used in this technique. You have a sliding definition of period 1 and period 2 and use a metric like minimum AICc to decide the point where the behavior (e.g., mean, slope, et cetera) changed. That is, as the change point varies, you re-fit the model and capture the metric, then plot the metric versus the change point.

Learn it once, use it forever!
Highlighted
ron_horne
Super User

Re: Parameterization choice and model comparison in logistic regression.

@dale_lehman 

Thank you for your answer, I was hoping there is another way of introducing the Year and Change variables in a model since this is not the classical ANCOVA where two independent variables present completely different dimensions (i.e. age and gender).

 

@markbailey 

Thanks for the lead, but I am not sure at all my data is appropriate. Could you tell me how to go about in terms of platform and variable roles?

I was not sure my data is appropriate. This is real data from political science (numbers are real, variable names changed), therefore, observations are not exactly sequential, they do have an exact date but it is not meaningful within the year. I am also not sure I should be looking for the best fit split in the data. The technical change may have taken time to “kick in” but there is no interest in estimating that lag (which would be very important in process monitoring). I just need a way of estimating the probability of target across the time periods comparing the two periods (pre and post) while controlling for the other variables.

I was hoping I can crudely estimate the trend using the Year variable and test whether there is a change in it or perhaps gain insight that it is just a drop between pre and post.

 

This discussion is helping me go through the whole thinking process of the modeling and the context. Thanks a lot!

Highlighted

Re: Parameterization choice and model comparison in logistic regression.

You use the Logistic Regression platform. Use date, not year. Create a binary variable for period (before, after). Fit a model where the linear predictor is date, period, and period*date. Change the period values based on date. (Imagine that the observations are sorted by date in ascending order. Set all rows to period = after.for the first fit. Now advance period to the second date, so period = before for row 1 and the remainder remain set as period = after. Progress until period = before for all rows.. Collect the AICc criterion for each fit. Plot the criterion versus date to see where the change occurred. Evaluate that model as you normally would.

 

Note that you might want to avoid the boundary issue with all rows having the same level for period.

 

I am not sure if this approach will help. Just an idea.

Learn it once, use it forever!

View solution in original post

Article Labels

    There are no labels assigned to this post.