Continuous Variables Nested within a Categorical Variable?

angie_newbie · Apr 3, 2020 01:46 PM

Hello, I am new to JMP and am not sure how to model my data so any suggestions on the steps I should take would be much appreciated!! I apologize in advance for my very limited knowledge of statistics.

I have a data set with a categorical variable (donor) and two parameters unique to each donor that are continuous (age and BMI). I would like to see how different response variables (also continuous - e.g. cell growth, etc.) are affected by a donor's age and BMI, and whether there is any interaction between age and BMI in determining the responses. How would you go about doing this? I have been using fit model to do this, at first with donor, age, and BMI all as separate inputs. It seemed like the results explained differences among the donors only, and did not elucidate the effects of age and BMI. When I used fit model with just age and BMI (excluded donor as a factor), the results showed regressions with the effects of both age and BMI. So it seems like there are two ways to model this; one where I can report which donors are significantly different from one another for each response, and one where I can report the effects of age and BMI on each response. I am not sure if I need to use the cross button in the second model to observe the interaction effects of age and BMI. I also don't know how to model all 3 independent variables at the same time. Do I need to nest age and BMI within donor? The support page for the nest option says that nested terms must be categorical.

Thank you so much, any feedback is much appreciated!!

statman · Apr 3, 2020 06:30 PM

Hi, welcome to the community. Is it possible for you to attach your JMP data table? You have an interesting situation...While you suggest that age is nested in donor, I don't think that is the only interpretation. When I think of nesting, I think of hierarchy. A variable nested inside another is contingent upon that variable. For example, you might have a component of variation which is batch-to-batch variation and another which is within batch variation. The within batch component is contingent upon batch (you can't get multiple batches from a within batch component). I don't think age is contingent upon donor...Another item you need to think about is multicollinearity. BMI and age might be collinear. In any case, when developing models to explain the variation in cell growth, you need to realize the analysis dpendst upon what terms are in the model. Change the terms in the model and the analysis likely changes. To combat this dilemma, I suggest you start with subject matter expertise, develop hypotheses (scientific not statistical) as to why the terms in the model would affect cell growth and then build the model from there. If you don't have hypotheses, then saturate the model and remove terms to get an appropriate model.

"All models are wrong, some are useful" G.E.P. Box

statman · Apr 4, 2020 11:17 AM

Thought about your situation a bit more...seems to me age and BMI are confounded with donor, so you can't have all terms in the model.

"All models are wrong, some are useful" G.E.P. Box

angie_newbie · Apr 4, 2020 04:06 PM

Hi @statman. Thank you for the clarification on the nature of nesting and your thoughts on my problem. I cannot share the data since it is proprietary information. However, I can share a completely fabricated dataset that is totally unrelated but conveys the same problem I'm grappling with. You are definitely right about confounding variables - it is impossible to uncouple age and BMI from donor, and I think that is why the output was strange when I added all three to the model. I ran the fit model again with just age and BMI (excluded donor), and age crossed with BMI, and saturated the model like you suggested with all of the other inputs that might garner a response. I removed each one that was not significant and was able to observe the interaction between age and BMI in determining their effects on each response. This seems appropriate in observing the relationship between age and BMI for each response, but now I am unsure if I need to run donor vs. each response in a separate model to determine which donors are significantly different from one another for each response. The first model with all 3 inputs gave me the results for an ANOVA, but does not give me the option to run a Tukey test, whereas running donor as a sole input does. Do you think running two separate analyses to examine the effects of all 3 inputs is necessary? Please let me know if you'd like a silly fake dataset. Thanks so much for all your help!

angie_newbie · Apr 4, 2020 04:34 PM

On another note...I'm also having trouble determining the appropriateness of a model. If the residuals for a regression do not appear to have a normal distribution, should the model be discounted? Thanks.

statman · Apr 4, 2020 06:47 PM

Angie,

I did not see your "fake" data table attached. There is much to discuss. Some things to consider:

What questions you can answer, what conclusions you can draw, what tools you should use for analysis, and your ability to extrapolate the results are ALL completely dependent on how you got your data. I don't know this from your posts (this is part of why I wanted to see the data...you can code the data however you want).

Since you have multiple donors (I don't know how many?) you could do the fit model of age and BMI BY donor. This will at least get you to think about whether your results can be extrapolated to a wider inference (more donors). How well do the results of the analysis repeat across donors?

I'm not sure how you removed terms from the model. As you do this, there are some statistics that can be helpful (general guidance).

The delta between R-square and R-square adjusted, if this is large, you have over-specified the model.
RMSE, the smaller is the better model
Residuals (as you have observed). So we "like" NID(0, Variance), this makes sense...so if they are not normally distributed, you should question why. Plot them in time series. Hypothesize why they are not.

"All models are wrong, some are useful" G.E.P. Box

angie_newbie · Apr 5, 2020 12:02 PM

Hi @statman , thanks so much for all of the info! Attached is the table with fake data.

The scenario: There are 5 "donors", all birds. The donors are observed once a year for 3 years in a row, in an attempt to yield triplicate measurements for each of the 5 donors. Tail feather length, tail feather number, and birth weight are all unique to each donor. Tail feather length and tail feather number are assumed to have impacts on various mating success metrics, so the five birds are selected based on combinations of: short tail feathers and high feather number, short tail feathers and low feather number, medium length feathers and medium number of feathers, long tail feathers and a low number of feathers, long tails and high feather number. We want to see if number of offspring, number of mates, and egg size vary by donor, and how each variable varies by tail feather length, feather number, and birth weight. Also, if there is any interaction between the inputs. Hopefully this is an adequate hypothetical...

statman · Apr 5, 2020 12:34 PM

Angie,

Thanks for the explanation. Here are some comments/questions:

1. What is practical significance for each Y? How much of a change in each Y do you want/need to detect (or is of scientific interest)?

2. How confident are you in the measurement systems?

3. Tail feather length/count are consistent year-to-year?

4. You have the Y of Mean Egg Size, what is the variation in egg size? Is this a weight or a dimension?

5. You have purposely selected donors to be different based on the 3 characteristics you listed Feather length, feather count and birth weight). These characteristics are confounded with donor. You do not have enough data to model the 3 characteristics within donor. (2 degrees of freedom from the 3 years) and you would need at least 3 DF's to model just the main effects of the 3 characteristics.

6. You could look at the data graphically. Variability plots over time would be helpful.

7. You could look at the effect donor has on the Y's, but not enough data to model within donor factors.

"All models are wrong, some are useful" G.E.P. Box

angie_newbie · Apr 5, 2020 01:14 PM

Hi @statman,

1. My alpha value is 0.05

2. I am very confident in the measurement systems (all assays qualified in the lab)

3. For this fictitious example, I wasn't sure how to get replicates for each donor so I just said 3 years but the real experiment was conducted in a lab, all measurements taken at the same time and conditions the same for each donor so there is not variation due to time

4. Maybe we can change that to mean egg count? The actual experiment has nothing to do with birds or eggs - I was just trying to come up with another dependent variable.

5. That sucks. We only had materials for 15 different samples. Is this a statistical rule that you should have at least 3 df?

6. Time wasn't really a factor (as in #3 - let's just pretend they were all true replicates)

7. Sounds good.

Thank you!!

statman · Apr 5, 2020 06:24 PM

Angie,

Sorry, I'm not sure I can help without understanding the situation. I thought you had sent fake data, but the situation was real...apologies. I have some further points regarding your replies (if you'll indulge me).

1. By "your alpha is .05" are you talking alpha risk? That is used for statistical significance. I am asking about practical significance. This is ALWAYS more important than statistical, because statistical significance is dependent on the data, how you acquired it, and what comparisons you make.

2. Not to be belligerent, but having the assay done in a lab, does not make it "good". I have many examples of lab measurements that were incapable of detecting that which they were intended to detect. Do you have data to support the precision of the assays? If so, by all means proceed.

3. The setting for the experiment is in the lab? What inference space do you hope to draw conclusions about? Will the conclusions be drawn on "conditions" outside of the lab? Is so, proceed with caution. You'll want to make sure the results can be applied to the area of concern. Again, my advice is limited as it depends on the actual situation.

OK, you have 3 "data points" for each treatment combination. The question is how were they gotten? (I'm being careful about what to call those 3 measures as different folks use different language to describe them). Some guidance, if the "data points" are acquired without the treatment combinations changing, they are typically not considered independent events and therefore do not increase the degrees of freedom in the study (I call them repeats). The reason they vary cannot be assigned to the treatments (factors) because those were "constant" when the 3 data points were taken, so the reason for those values varying is due to variables that change in a short time period, short-term noise (e.g., measurement error). You can, in this case, look at the variation within treatment (graphically works) and then determine the proper summary statistics (e.g., mean, variance, range...) which become the Y's to model.

If the 3 data points are indeed independent events, then they may be considered replicates, which increase the DF's available. Often the advice is to do randomized replicates as this provides, hopefully, an unbiased estimate of experimental error to be used in statistical tests. It also increases the inference space and perhaps provides some "unassigned" DF's to add covariates. In my opinion, there are other more efficient and effective ways to increase inference space, assign the effect of noise, partition the noise and thereby increase the precision of the design (detecting factor effects), create robust designs, etc. (RCBD, BIB, split-plots, etc.)

4. I'm not sure what you're saying here, but selection of response variables is critical. The more continuous the response variable, the more efficiently and effectively you can understand factor relationships. I always suggest to measure as much as is feasible, video the experiment, record sounds, etc. You never know when a new response variable leads to discovery. Certainly if the problem is one of variation, you should have a response variable that quantifies the variation (e.g., standard deviation, variance, range, etc.).

5. There is a host of advice regarding selection of the correct tool and number of samples needed. My advice is to start with a list of questions (e.g., What questions do you want to answer? What hypotheses, about dependent and independent variables, do you have? How will the phenomena be measured?, etc.). Once you understand the situation, what types of data are available, what effects you want to estimate, what resources are available and what is the sense of urgency, etc. you can design the study (sampling plan, experiment, etc.). I'm going to oversimplify...degrees of freedom is the amount of information you have in the data set. With the data set you sent me (where there was a time series), you had 15 data points so you have 14 total degrees of freedom. Each factor has multiple degrees of freedom ( e.g., Donor has 4, Tail Feather 4, Tale feather count 4) and then you have interactions (multiply the factor DF's). As you can imagine the number of degrees of freedom for the factors alone adds up to 12 degrees of freedom, so there are not enough DF's to estimate the interactions. The rule is you need enough DF's to estimate or compare what you want to estimate or compare. (sorry the # of DF's you need is completely dependent on the situation). I'll end with a paraphrase from one of my favorite DOE authors (Cuthbert Daniels):

The commonest of defects in DOE are:

Oversaturation: too many effects for the number of treatments. (this is your issue)
Overconservativeness: too many observations for the desired estimates
Failure to study the data for bad values
Failure to take into account all of the aliasing
Imprecision due to misunderstanding the error variance.

That's all I got for now. Thanks for indulging me. Let me know if I can be of further assistance.

"All models are wrong, some are useful" G.E.P. Box

Continuous Variables Nested within a Categorical Variable?

Re: Continuous Variables Nested within a Categorical Variable?

Re: Continuous Variables Nested within a Categorical Variable?

Re: Continuous Variables Nested within a Categorical Variable?

Re: Continuous Variables Nested within a Categorical Variable?

Re: Continuous Variables Nested within a Categorical Variable?

Re: Continuous Variables Nested within a Categorical Variable?

Re: Continuous Variables Nested within a Categorical Variable?

Re: Continuous Variables Nested within a Categorical Variable?

Re: Continuous Variables Nested within a Categorical Variable?