Re: ANOVA vs. GLM for ecological field experiment - Page 2

cbhalpern · Oct 5, 2020 03:42 PM

I am analyzing data from an ecological field experiment, designed to test the effects of herbivore (limpet) removal and moisture addition on change in cover of a rocky intertidal seaweed. The experimental design is a randomized block, two-factor full-factorial design. I have attached a document that details the experimental design and lists the scripts and model outputs relating to three questions for the Community Discussion:

Which modeling approach—Standard Least Squares ANOVA or GLM—is preferable for modeling results of an ecological field experiment? See Background.
Are the scripts I used to run each type of model constructed properly? See Scripts.
Given that the GLM (normal, identity) and ANOVA approaches both assume normal distributions, why does the GLM yield consistently smaller p values than the ANOVA? See Model output.

cbhalpern · Oct 11, 2020 02:17 PM

We have a follow-up question regarding whether one tests the same assumptions regarding normality in GLM as in ANOVA.

Our understanding is that to test the assumption of normality in ANOVA, one tests the normality of residuals. To do this, we would save the residuals to the data table, then use the Distribution tool to test normality.

Is there a corresponding procedure for GLM? If so, which residual output to use--Deviance residuals, Pearson residuals, Studentized Deviance residuals, or Studentized Pearson residuals?

statman · Oct 11, 2020 02:51 PM

I'm a bit confused by this "Our understanding is that to test the assumption of normality in ANOVA". There is NO assumption of normality of the raw data analyzed via ANOVA. The only assumptions are NID(0, variance). That is normally and independently distributed residuals with a mean of 0 and a constant variance. There are a number of ways to look at residuals:

1. in a time series,

2. vs. predicted

3. as a distribution,

4. leverage plots

There are a number of types of residuals as you have listed some. There is no right one to use. Easy enough to look at them all.

"All models are wrong, some are useful" G.E.P. Box

Mark_Bailey · Oct 12, 2020 05:02 PM

@statman describes the assumption for the linear regression model well. GLMs do not have any such assumption about the errors, so your concept of residual analysis does not carry over to the GLM. We can use the various residuals from the fitted GLM to check for outliers, influential observations, and bias, though. The GLM assumes that the response follows the conditional distribution that you selected (e.g., Poisson, binomial, normal) and maximizes the likelihood of the model parameter estimates.

It's different.

cbhalpern · Oct 12, 2020 09:28 PM

What would be the best way to learn about using the various residuals from the fitted GLM to check for outliers, influential observations, and bias?

statman · Oct 13, 2020 10:45 AM

Again, I'm not sure I understand the question? If it is about learning how to use residuals to determine adequacy of your model, there are many books and papers on the subject. First, a residual is the difference between the prediction from the model and the actual value. Regardless of any assumptions required or implied, in order for the model to be useful, you would hope that it does a fairly good job of predicting the actual results. From a model development perspective:

1. you would hope the model isn't biased high or low (the residuals would be distributed around 0),

2. you would hope the residuals have about the same variation around the model and this variation isn't fluctuating greatly ( the residuals have a constant variance)

3. you would hope there aren't any unusual data points not explained by the model (absence of outliers in the residuals)

4. You would hope the residuals didn't form some pattern or were related to each other (independently distributed)

If some of these hopes are not satisfied, you should seek to understand why. You should challenge the effectiveness of your model and how you can modify it to make it more useful (and better yet arrive at a better understanding of what is actually going on).

"All models are wrong, some are useful" G.E.P. Box

Mark_Bailey · Oct 6, 2020 04:45 PM

Regarding point 1, the main benefit of the GLM is that you are not restricted to a normal distribution model of the response. If your response is (conditionally) normally distributed, then just use OLS.

statman · Oct 6, 2020 04:56 PM

Box also liked to do the Bayes plots. Cuthbert Daniel was the first to use the Half Normal plots for the analysis of experiments. Essentially, if the null is true for all treatment effects, then their effects should be normally distributed (randomly) with a mean of zero. The Normal plot depicts the effects transformed so that effects that appear on a straight line are random effects and points that depart from the line are assignable. JMP adds Lenth's PSE line to the plot, but you must be careful with the interpretation.

"All models are wrong, some are useful" G.E.P. Box

P_Bartell · Oct 6, 2020 10:54 AM

To add to @Mark_Bailey and @statman 's comments/counsel have you plotted the data in a meaningful fashion BEFORE modeling? Since you conducted an experimental design, Factor and block vs. Response plots, simple distribution of the responses, and a response vs. experimental execution order are a minimum set of plots I'd suggest you create. With an eye towards answering the following questions, which ultimately are used to build process understanding AND help insure we've got clean data for modeling:

1. Do the factor/block plots suggest something that flies in the face of known biological/physical phenomena? Back in the day, I'd suggest to the engineers we look at these plots. If they show the moral equivalent of 'the plots suggest water runs up hill, well, we've got a problem'.

2. Are there any response outliers that suggest unanticipated lurking variables that may have crept into the conduct of the experiment?

3. Does the experimental execution order plot of responses suggest a trend or suspicious pattern that again there may have been a lurking variable within experimental conduct?

cbhalpern · Oct 6, 2020 04:37 PM

In answer to your question, yes, we used JMP tools to look for outliers and heterogeneity of variances among the 2x2 treatment combinations. Responses vary among blocks (spatial units), as expected, hence the motivation for blocking. There is no replication of plots (tmt combinations) within blocks.Order of execution is not relevant, because treatments were applied at one time to all four plots within each block.

We also plotted means, sems, and individual block values for all response interval x block x treatment combinations (2 factors at x 2 levels each). Our goal with these models was to determine the underpinnings of patterns readily visible in the data.