Subscribe Bookmark RSS Feed

Can Logistic Regression estimate original model parameters from simulated data?

ranjan_mitre_or

Community Trekker

Joined:

Oct 16, 2013

I am new to JMP and Logistic Regression, and I would like to test the estimation of model parameters using Logistic Regression. I have simulated categorical response data from model parameters (betas) that I put in the simulation. I have included all different types of independent variables; nominal, ordinal and continuous, in the model. Then I generated each independent variable using a random number generator, calculated the probability of the response for each instance of randomly generated independent variable, and then generated the response variable (1 or 0 ) depending upon the calculated probability of the response variable. I repeated this until I had a large number of data points (each data 'point' being - one instance of all the independent variables and the corresponding response variable) that I fed into the logistic regression analyis. My questions are:

1 Can I expect the Logistic Regression to estimate the original Betas?

2. Should the estimates asymptotically approach the true values of betas (that I used to generate the data) as I increase the number of data points?

3. What happens in case of the Nominal variable which are estimated by the Dummy Variable method in Logistic Regression? I have an independent variable that takes on 3 nominal values, for which I used 3 corresponding values of Betas to generate the simulated data. The output of the regression analysis comes up with two betas, and the third values is negative of the sum of these two (if I understand the method correctly.)

Any help in developing this understanding will be appreciated.


Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
Solution

Here I am replying to myself. But, I have done a number of simulations and here is what I have found out. If I simulate data with categorical variable coefficients (betas)  that have the DOF as assumed by JMP I must assign values of N simulated variable coefficients, a1, a2, a3....aN as [a1, a2, a3...aN=-(a1+a2+a3...a(N-1)]. In this case, the original values are correctly estimated by running Logistic Regression on the simulated data. If I do not use the last coefficient aN according to the constraint: aN=-(a1+a2+a3...a(N-1), and just assign an arbitrarily different value to it, then the estimated coefficient estimates are different than the original values. But, the estimated betas yield the probability values that are the same as the probabilities computed from the original beta values.

7 REPLIES
reeza

Community Trekker

Joined:

Jun 23, 2011

I'd like to say no, because you can't account for the covariance between the variables but my stats knowledge is very fuzzy today.


Generally though, if your model has multiple parameters at a time or any interaction terms you may not be able to recreate your Beta's.

If you're doing it one variable (univariate) at a time perhaps.

ranjan_mitre_or

Community Trekker

Joined:

Oct 16, 2013

Thanks.  I did not use any interaction between any of the independent variables. Each IV is independent of others.

Your answer gives me one thing to try. I will simulate the data with one variable and see if at least that allows me to estimate my "true" beta from the simulated data.

Any suggestions about my third question about interpretation of the dummy variables used in place of the nominal variables?

reeza

Community Trekker

Joined:

Jun 23, 2011

Re Third question, you're trying to determine the coefficient of 2 variables together, so back to the initial problem of covariance.

rick_sas

Staff

Joined:

Jun 23, 2011

1) Yes, generically.

2) Yes.

3) It depends on the parameterization, but the reference level is usually combined with the intercept estimate.

For more on this topic, see Chapter 11-12 in Simulating Data with SAS, particularly Section 12.2.2.

For a discussion of the effect parameterizations, see SAS/STAT(R) 12.3 User's Guide

ranjan_mitre_or

Community Trekker

Joined:

Oct 16, 2013

Thanks. I looked into section 12.2.2 of your book. That simulation in the example contains continuous variables, for which the betas are estimated. My question is about nominal variables which are implemented as dummy variables. So, if a variable takes on A1, A2 and A3 as possible values, with coefficients b1, b2 and b3, for example, only the intercept and b1 and b2 corresponding to the dummy variables are spit out by JMP. How do I convert these estimates to compare with the orignal b1, b2 and b3 that I used to simulate the data?

rick_sas

Staff

Joined:

Jun 23, 2011

Sectio 11.5.2 discusses this a little, expecially p. 217 where I note that the parameter estimates are not unique. However, if you use a simulation model that matches the parameterization of the analysis (as I do on p. 216-217), you can often verify that your simulation is correct.

Solution

Here I am replying to myself. But, I have done a number of simulations and here is what I have found out. If I simulate data with categorical variable coefficients (betas)  that have the DOF as assumed by JMP I must assign values of N simulated variable coefficients, a1, a2, a3....aN as [a1, a2, a3...aN=-(a1+a2+a3...a(N-1)]. In this case, the original values are correctly estimated by running Logistic Regression on the simulated data. If I do not use the last coefficient aN according to the constraint: aN=-(a1+a2+a3...a(N-1), and just assign an arbitrarily different value to it, then the estimated coefficient estimates are different than the original values. But, the estimated betas yield the probability values that are the same as the probabilities computed from the original beta values.