I have been given some data for an observational study that has been completed for the last three years and I am in need of some advice/help on how to proceed in terms of testing assumptions. The study is aimed at determining if there is a significant difference in vegetable quality across four locations in WA state, each location representing a different climate in the state. My issue with the data is that measurements of quality indicators were taken from composite samples at each location for the last three years, meaning there were no replicates. For example, at a farm in Seattle 50 heads of lettuce were randomly selected and composited into one sample from which quality indicators were measured, such as firmness and acidity. So, before I run an ANOVA to compare quality across locations I don't know exaclty how to test the assumptions of normality or homogeneity of variance as I only have one data point per location per year. I don't want to pool the data for the years as environmental factors such as chilling hours, degree days, etc... have significantly changed over the last three years.
I would really appreciate some help.
Just so I understand the 'design' that you've been asked to analyze, if I were to try and put your experiment into a JMP data table it would look like this?:
And the 'Y' values don't really exist, but are some creation of the 'compositing' process? So maybe what was done was say, acidity was measured for all 50 samples and somebody decided the 'true' value for Year One, Location A is "Y"?
Thank you for replying. Attached below is a snapshot of my JMP file:
There are a total of 80 rows, 20 rows per year, 5 locations per year x 4 cultivars evaluated at each location. Again, I am looking at whether fruit quality, represented by the five quality indicator measurements, significantly differs across locations.
I can understand your desire to exclude as many environmental effects as possible from your statistical analysis. Unfortunately, with only a single data point within locations, cultivars, and years, there will not be enough data to characterize the variance without the effect of annual changes in environmental factors. However, even if you had enough data to characterize the effect of these annual changes, there would always be additional sources of variability that such an observational study would be unable to characterize. For example, if each farm ships samples to a common facility for testing, the shipping conditions would be different for a farm located further away from the facility versus one that is close by. Although such conditions may not have a significant impact on the results, characterizing their effects would be outside the scope of an observational study on vegetable quality.
In practice, any observational study is going to have difficulty characterizing variance and bias introduced by environmental effects because the data is not produced in a controlled environment. Although I understand your concerns about conflating environmental effects and location specific effects, distinguishing between them will be impossible without conducting a more rigorous observational study.
Hope this helps,
Thank you for getting back to me so promptly. I completely understand your comments on annual changes, I have already looked at annual factors such as chilling hours and growing degree days and they appear to be quite variable, the former having decreased and latter increased annually. Thus, I am not really comfortable in using the years as replicates because my feeling is that if I find a significant difference in quality across locations, it won't really be due to location as I am hoping to analyze, but in fact the variability in annual factors. I knew from the beginning that with observational studies characterizing variances would be difficult and with my issue of replicates, I don't think I really have any flexibility. With your final comment, I believe you are implying what I have been hoping to clarify, the study at this point is too statistically weak to make the strong conclusions I would like to make for extension purposes. If I have implied correctly, then please do not feel like replying, I have appreciated all the help.
In my final comment I meant that it will be impossible to distinguish between the variability caused by environmental factors and location specific ones. This does not mean that you can't get meaningful information out of the data you already have. You just have to take this into consideration when you do your analysis.
The data you have is going to have a high degree of variability from many different sources, including the annual environmental factors. However, if the environmental factors affected all of the farms in the same way, e.g. chilling hours decreased by the same amount for each location, then the effect will bias all of the data in a certain direction rather than increasing variability between locations for each year that data has been collected. The problem is that this yearly bias will increase the variability of data collected over different years. When there is an increase in the variability within the samples you want to compare, the more difficult it becomes to distinguish a difference between samples.
You are right to be concerned that any statistically significant difference you find between locations may be due to the annual environmental factors (Type I error). The reason for this is that an increase in overall variability can increase the sampling bias, especially with only 3 samples among years. However, it is also likely that the increased variability will cause no statistically significant difference to be found when an actual difference exists (Type II error).
The data you have can reveal where the largest differences are observed and if these differences are consistent between locations then you can design future studies to investigate them further.
Pasted below is the crux of my dilemma. My assumption of normality is rejected, no matter how I transform the data (which is not shown). Also I believe the assumption of homogenous variances is rejected. The incredibly small sample size of 3, we decided to just count each of the years as a replicate, I know is my biggest issue. I am really looking for advice on whether there is anything I can do in terms of the assumptions or do I have to wait for another year's data, another replicate? In running the ANOVA, there are significant effects of location, but with the assumptions issues, I know/believe I can't make any conclusions/correlations for this observational study with any sort of confidence.
I'll try to keep this brief...you are hitting on many topics here...among them assumptions for various statistical hypothesis tests, alpha, beta risk, and such. These topics are all intertwined and very laborious to elaborate on here in a text based Forum discussion group.
With these very small sample sizes, have you just tried some simple Fit Y by X style visualizations in the Fit Y by X platform, or Graph Builder. At this stage I wouldn't get too hung up on assumptions...plot the data and what can you learn from there? If there are differences in population means...maybe they will leap off the page at you...or not. But starting there is my recommendation.
Again, thank you for replying it is much appreciated. I have tried Fit Y by X and for the effects of Location I observe highly significant p-values for many of the quality indicators for a couple of the cultivars. I understand that ultimately it is up to the researcher to use his or her expertise on the matter to make calls on flexibility, for example using the years as replicates to increase sample size, even though this is not best given the variability in annual effects. However, the assumptions are highly rejected and based on the relatively little stats I have learned thus far, it does not seem right to make the correlations/conclusions I can visualize.