I have been asked to analyze multiple main effects and their various interactions, but don't think I can do so as I achieve Lost DFs error messages when I use the Fit Model analysis. In case it helps, here the details of the data. Main effects (size): Location (4), Cultivar (4), Region (2, 2 Locations grouped per region), and Year (3). I only get results/appear to have enough DF when I run any two main effects maximum, but never with an interaction.
Question 1) Do I have a degree of freedom issue in terms of not being able to analyze more than two main effects, hoping to look at all four?
Question 2) If I was to run only two main effects, it is correct to rerun the model when a main effect is not found to be significant? For example, I ran Year Location and found neither to be significant, so then I reran the the model removing Year as I am interested in Location and again found Location to be non-significant, but with a lower p-value.
Appreciate any advice
all I answer below is based on having no data to take a short look at it. I say this as there are different possibilities why you may or may not find more main effects to be significant.
So first of all, I guess you are using Standard least squares as modelling algorithm and the model resulting effect tests, rigth? If so, there are mainly two reasons why you will have a main effect shown as not significant. Either it is not significant, means it has no significant effect on the Response, or there are colinearities and e.g. Location (which is not significant in all three models is e.g. in correlation with either Year or Cultivar.
Again, without having the full Setup for the data I just can guess. But that's what I would do first before modelling:
1. Open the Distribution Plattform and put all factors and Response variables as Y. Then click on any histogram bar e.g. of Location and see where the selected Distribution lays and how it differs from the Overall Distribution for all of the other variables. Do you see any differences in the distributions of the selected and Overall distributions? Or Patterns ,like one Location has lower values than the other Location and so on. This will allow you in a very easy way to see if there is any suspicious going on here which could result in significant behaviour (even if it is not statistically proven yet).
2. Take a look at the Fit Y by X. I guess Location, Cultivar and Regional are categorical, may also year. Use your Response as Y and the others as X. You will get an Anova for the categorical X (in case your Y is continuous, otherwise contingency Analysis) Either way you will see if there is a significant effect of one X on one Y. Thsi can help to see if there may be something significant what is not significant in your model. Then it is likely due to some interaction effect.
3. Take a look at Fit Model. Fit an RSM model, so all interaction and quadratic effects for all of your X variables. As they probably are all categorical you will not Need the quadratic effects and may get a warning. Not do not run a Standard least squares but stepwise regression. Use p-value threshold as stopping criterion, and leave the rest as Default. Now check what effects are significant. Do they differ from your previous model?
Make the model with the selected effects only and run it again with Standard least squares. Do you still get a DoF issue?
In Addition to the above you may look at the lack of fit test in your models, usually in the Report below variance Analysis. Is the F-probability significant? If so it tells you that the model is missing an efffect. So either there is an interaction missing or some other variable you have not used or captured. But the model will not reflect your data good enough. You could take a look at the actual by predicted plot, the residuals plots as well on the profiler. to get a better understanding of the behaviour of your factors based on the model.
So using visuals from the Graph Builder/Distribution or other platforms will provide you some General understanding on the data, the modelling plattforms will help you to find a good model but can only work on what you provide it. There are sophisticated models which could do that work for you based on the algorithm, and if you have JMP Pro you could use them (e.g. bootstrap forest, generalized Regression ...).
Another aspect I didn't ask so far is about your data. How many observations (/rows) do you have? Are there missing data in it? do you have (potential) outliers. Is the data from a designed Experiment or from a measurement or gathered sales data?
All this could lead to a slightly different Approach I would recommend as each of them have other challenges to overcome and to take into account,
Hope this helps to make the next step,
Thank you very much for replying.
In terms of the questions you posed:
- I have 48 rows of observations per Response/Y (Tannin, pH, TA, SSC, and SG), but these observations are split among 4 cultivars, 4 locations, and over three years, so 48/4/4/3 = 1, no replications.
- To complicate the lack of replication, I do have missing data for some of the years. For example, for Cultivar A in 2012 I might have data for Locations A-C, but not D.
- I doesn't appear that I have an outlier issue, but again hard to tell with my situation
- Data is from an observational study; fruit was collected from participating growers, juiced into a composite, and the five measurements/responses drawn from the composite. This process was done for each cultivar for the four locations over three years. The only randomization occurred in the selection of the fruit by the growers for us. But again we lacked replication i.e. multiple orchards representing each Location rather than just one.
Here is some output from the Fit Model (Personality: Standard Least Squares, Emphasis: Effect Leverage) with a Full Factorial minus the three way interaction. Looks pretty good, just not able to test the three-way which may not be significant.
A rule of thumb that I've always used is that you need 1 degree of freedom for each term in the model that you are trying to run plus 1 additional one for the intercept - minimum. This rule will not allow you to estimate the error terms in your model, so you will need additional degrees of freedom for that.
From the first table it appears that you are trying to describe a model with 10 terms (including the intercept) and only have 7 or 8 rows in the data table. This means that you are attempting to propose a model that would be supersaturated. In such cases, partial least squares can be helpful, and you might want to have a look at that option under Analyze > Multivariate Methods > Partial Least Squares. Otherwise, you don't appear to have a sufficient number of observations to do what you are proposing.
Thank you very much for taking the time to reply. I hope this makes sense:
I have no replications, a very weak observational study. For the response variable of Acidity for example, I have 48 observations but that is split among 4 locations, 4 cultivars, over 3 years (48/4/4/3 = 1). I did re-analyze the data with some changes (fixed some user errors) and have been able to test three of the four main effects and two way interaction as seen below, but not the three way interaction.
1) Am I able to test the three way interaction with the Partial Least Squares analysis?
2) Is the reason I have been unable to test the fourth main effect, Region (a grouping of the 4 Locations into 2 Regions), with or without interactions, because it is colinear with Location? If so, do I just proceed without it in the model? It does not really need to be in the model, but would be nice to test if possible.
3) Assuming 1) and 2) are resolved, is it correct to remove a main effect from a model if it is found non-significant or is this only allowed for non-significant interactions?
Full Factorial: (Fit Model, Personality: Standard Least Squares, Emphasis: Effect Leverage)
Full Factorial minus three way interaction:
Have you tried stepwise regression as the personality?
Travis: I concur with all the advice and counsel you've received from Martin and Mike above. One other approach you might want to consider is, if you are running JMP Pro, the Generalized Regression personality within the Fit Model platform. Generalized Regression can be used with success in a wide variety of contexts and the 'not enough DF' is one of them. Not a panacea, but it might help.