I have to do logistic regression, but I have some trouble with my explanatory variable, it's kind of a continous variable with one group outside the continoum.
age at exposure:
I would like to see if there is a link between age of exposure and disease state, but because of the "not exposed" I have a categorical variable. The only way around this seems to be to remove the unexposed group, but that doesn't seem right. I've considered asigning a number for the "not exposed", but that doesn't seem right either, because they never were exposed.
What is the best way to proceed?
One thought for you is set up the predictor variable as Data type = Numeric and Modeling Type = Ordinal. And you can set the "No exposure" level as a zero level with some intuitive interpretation as the 1, 2, 3 etc. levels as you have articulated them.
An alternative approach could be model the exposed as a numeric predictor variable, leaving the unexposed completely out of the analysis...this answers the "If an individual IS exposed, here's the relationship?" Which would allow for interpolation (if it makes sense) for individuals exposed at say 1.5 years?
A third approach is lump all the exposed individuals into one group and call them "Exposed" and now include the unexposed in the data set.
There are probably other solutions as well...but these come to mind of the top of my head.
Thank you for your answer. I've already done solution 2 and 3.
The thing with number 2: Don't you run in to an immortality bias with that solution? I mean, we see an increased risk as the animals get exposed at a more advanced age, but the risk of this disease goes up as they age, so I don't really know how much information about the exposure we get from that?
The 3rd works well, I even tried lumping it into 3 groups (early/late/unexposed) and I completely get how to interpret that. It's mostly because I want a continous x, but including that stupid unexposed group, but I think solution 1 might fix that!
This sounds like a survival problem - and your "not exposed" would be a censoring variable. In other words, the animal has not been exposed but might still become exposed. But if that approach would be valid here, I have a related question: survival models are continous models for the time to event but with a censoring variable. Another approach to such a problem (suggested above as well as used frequently in modeling things like insurance problems) is to use a two-step modeling procedure. First model the discrete event/no event frequency using a classification model (such as logistic regression) and then model the "severity" using only the data where the event has happened (this would be a continuous dependent variable). Can anybody shed light on the relative merits of using these two alternative ways to model such a problem?
Dale Lehman's idea of a censored approach was one I was thinking of as well. This precisely why I was somewhat uncomfortable suggesting the kitchen sink ordinal only approach. The unexposed group is clearly different than the other exposed group from this censoring perspective and censoring/suvival models are ideally suited for just this scenario.
This is a great discussion. The mix of methods plus the questions to be answered. The discrete y/n vs. the severity and how to treat the n category that has no severity. I don't have a good answer, would love to hear some other thoughts. However, I do think one has to be careful of the context of the questions to be answered. For example, are you trying to diagnosis? That is separate the y from the n. Or are you trying to manage? That would be in the y group can you separte the severity levels? And if that is the case should the n's be included? Have they just not entered the y group yet or are they really a control group and would never be in the y group? Then the questions such as are you in discovery mode (i.e., R&D) or some other mode? Risk/Cost of decisions made on the analysis.
I had not thought about censoring in these types of problems, will have to give that some more thought. However, as Dale mentioned, the survival type models are used for time to event, so you would have to think about how to treat severity as a time to event?
I'm officially way out of my league at this point, but I'll try to explain.
My core data is:
Date of birth, date at exposure, gender, and date at diagnosis. All animals had the disease.
From this we can calculate: age at exposure, age at disease and time between exposure and disease (to try to
From this we tried to do a t-test, where we looked at the mean diagnosis age in relation to gender and exposure date. However this only allowed us to see whether animals in a specific group would get the disease earlier than the others, maybe we should have done a proportional hazard, we didn't.
We then collected a control group (not originally a part of the project). For those we have date of birth, date at exposure and gender.
We want to know if there is a different risk of disease for the different groups
Is there a significant OR between the groups:
I have done this both with 2x2 and by log reg. By 2x2 I have OR's for all groups against the unexposed group, this has been done for both early exposure/late exposure and exposure for year 1, 2 3 etc.
The log reg has been done in the same way, but only the early/late grouping is intelligible for me (the othe has too many different OR's).
I'm still trying to figure out how to interpret the log reg; how do i get the OR when you take both explanatory variables into account. So right now i know the OR for exposure group 1 against 2, but I want to know the or for females in exposure group one vs. females in exposure group 2, that kind of thing. That is what we agreed to with our advisor.
We have talked a lot about the whole survival issue, but haven't really settled on anything. I've tried to calculate a fit proportional hazard with age at diagnosis as y and gender and age at exposure as explanatory variables, but I'm unsure what the model does with the healthy control group (their age at diagnosis is an empty field right now). I'm also unsure about how to input the censor variable. Right now it's a part of the age at exposure column and whether I'm supposed to use it as both a censor and model effect is not clear to me.
I'm sorry it's so rambly, but I'm not a statistician or anything close to one and what I'm trying to do is probably way over my level of understanding.
A lot of details and options there, but I have a few ideas. If you are solely trying to model how long it takes an animal to get the disease - from your first data set where all the animals had the disease - then I think it is simply a regression model with whatever explanatory factors you have available. Once you introduce the control group, I think it is a survival model. Your control group has not yet been diagnosed with the disease, but may still get it. So, that is a censored variable. The variable indicating disease or no disease (once you have combined the two sets of data) is the censor variable and you would censor the no disease values. You can then include whatever explanatory variables you want to explore. The dependent variable would be time - for the control group it would be whatever time you have them observed for, but the censoring takes care of the fact that they have not gotten disagnosed with the disease as of the time you conducted the study (so if they've been observed for 8 months, you would not be saying it took 8 months for them to get sick, but rather that as of 8 months they have not gotten sick yet, whereas the first sample you have all have a time until they got sick).