This is more "spit balling" than anything, so here goes. In a nutshell, I am trying to develop a diagnostic test for reef corals so that we may know which ones will likely die as a result of climate change and which will not. From a series of experiments carried out in the laboratory, I now know the protein signatures associated with resilient and stress-sensitive corals so that, in theory, I may now have the capacity to go out into the ocean, sample new colonies that I don't have any a priori knowledge on, and make predictions about their health.
Analytically, I have tried quite a few things with the proteins I have profiled, large proportions of which are collinear (making standard regression approaches poorly suited). It seems like using the discriminant analysis approach for "healthy" vs. "stressed" coral datasets could give me the answer that I want. I played around with stepwise discriminant analysis and basically tried to identify the minimum number of proteins that would 100% correctly sort the samples by health status. Am I interpreting this correctly? It seems like discriminant analysis is basically a MANOVA with some sort of predictive function (though MANOVA cannot be used in this case since I have 272 proteins measured in only 11 samples).
I saw a great JMP webinar today on generalized regression, so I have played around with that. I set my Y as health status (binomial: healthy vs. stressed) and used the 272 proteins as x variables. In some cases, the final set of "most informative" proteins were similar to those derived from DCA, but there tended to be more (on the order of five proteins could basically be used to build a model that could accurately predict coral health). One issue, of course, is that I am using samples where I already know the answer (i.e., whether the corals lived or died). Therefore, I risk having overfit it. Discriminant analysis seems to be literally just fitting the data entirely, perhaps not leaving any wiggle room for randomness or noise. Since I created a validation column for generalized regression, that should, in part, reduce the possibility of over-fitting, right?
And then there is partial least squares, but every time I've used PLS with OMICS (transcriptome or proteome) datasets, the R2 is always pitifully low, like 0.05-0.07. I saw a really excellent webinar last week on generalized regression in which the presenter basically use Formula Depot to compile the models derived from a number of approaches, which basically gave me the idea to do this sort of analysis. I guess, in closing, my main question is, when you are trying to make predictive models for a binomial (healthy vs. sick, in this case) response with hundreds to thousands of predictors, is any one of these methods considered the "gold standard?" It might also be worth mentioning that the data are typically non-normally distributed. This dataset is very small, and, if you click on the PCA, the unhealthy samples are very easily distinguished.
I clearly do not understand your problem, so this comment may not be useful. If you really have over 800 features and 11 observations, I can't imagine being able to conclude anything. It shouldn't be hard to distinguish between healthy and unhealthy observations (literally, you should be able to do this with almost any set of a handful of those features, though it would largely be a meaningless model). More importantly, I'm not sure about your dichotomous response variable at all. Are coral reefs really either healthy or unhealthy, with no gradations? Usually, health studies like this would treat this as censored data - the observations are taken at a point in time and at that time the unhealthy specimens might be observed while those that are healthy must be treated as healthy thus far (meaning the time to a bad outcome is the continuous response variable and the data is censored). Or are you proposing to only take observations at a time when you can declare the reef as healthy or unhealthy definitively?
Thanks for having a look at this. I was deliberately vague about my actual biological question, so maybe expanding on that will make my questions and needs clearer. I want to predict which corals will bleach (a process in which they turn white due to loss of symbionts and later die). Basically, I monitored the molecular biology of corals in the days-weeks preceding bleaching so that I can relate biomarkers concentrations and likelihood of bleaching. Basically, the various multivariate platforms yield similar results with respect to candidate biomarker selection (i.e., the proteins that best distinguish bleaching-prone from bleaching-resistant corals). What I wonder, though, is whether these results will have any predictive value in the real world when I test them with corals not yet sampled. I clearly need a model with some flexibility, and I think some of the approaches I've tried are not predictive modeling approaches per se, but explanatory, in which case they fit 100% of the data and would therefore by prone to over-fitting. Maybe this is a testament to having only 11 corals and 272 features, i.e., the resulting model will inherently be overfit.
https://www.jmp.com/en_us/events/ondemand/mastering-jmp/using-jmp-pro-to-build-generalized-regressio... This webinar is trying to achieve something similar albeit with humans (linking likelihood of getting diabetes from biomarker signatures). However, in this case, the Y is a continuous variable, so maybe I need to try to figure out how to rank corals, as you mentioned, along a continuum from 0 to 10 or something like that.
@abmayfield: I'll go back to your '...main question...', wrt to a 'gold standard' modeling methodology. IMO the notion of a 'gold standard' is just impossible to specify. If I can get a bit philosophical, there's a reason why so many modeling methods are 'out there'. If one thinks back around the growth and proliferation of modeling methods, it's really due to no one method works best for all types of problems, and data. So the idea I always preached to practitioners is try LOTS of them for YOUR problem and data and then see which works 'best'. For binary categorical responses, a few key model diagnostics methods are misclassification rate, ROC curves, and such. JMP Pro's Formula Depot and Model Comparison platforms can make the business of comparison efficient.
I'll give my meager feedback...It seems to me one the questions you are trying to answer is: "Do the studies I have performed in a laboratory setting have any capability of drawing conclusions in the "real world"? I don't think this can be answered through data analysis. What I suggest is that you focus on two things:
1. Developing better response variables (vs. discrete). This may take the form of some continuous variable (e.g., gradation of bleach, percentage of protein x, etc.) or if you can't create a continuous response, at least create an ordinal response.
2. Design sampling plans. Sampling plans are created as a function of your hypotheses. The plans may be hierarchical (nested), systematic or crossed They are meant to separate and assign different sources of variation and increase inference space. So it should seem that you are interested in what causes the degradation of coral reefs. You can sample actual reefs as a function of your hypotheses. Let's say you hypothesize that the ambient salinity has some effect. You would take multiple samples from within the same salinity and then sample from a location that has different amount of salinity. Let's say your hypothesis has something to do with ambient contaminants. The same idea apples, take multiple samples from within an area of "equal" contaminants and then select a different location where the contamination is different. Of course this can be accelerated with Experimental Design ideas, but the experiments must be conducted in the real world, not the lab. See Fisher, et. al.
Thanks everyone for your suggestions and feedback. I do realize that trying to predict the fate of corals in the wild based on 11 samples from a single experiment carried out in 2010 is precarious to say the least! Fortunately, I have compiled tons of field data from different corals collected at different temperature, salinities, light levels, countries, latitudes, etc. and am hoping to merge these data with those from the lab experiments to start to piece together some sort of coral diagnostic system. I like the human blood biomarker approach where they show you your concentration, as well as the normal "healthy" range. We don't know yet what the healthy ranges are for corals, so that might be something that needs to be worked out first.
In terms of the best modeling approach for this type of endeavor, I guess I am glad to hear that no one said "you should absolutely be using this approach, which you totally neglected to mention" since I have been basically investigating all of the various options out there. In the end, I will probably end up using a mix of approaches, though I am becoming more and more interested in these generalized regression models because they seem to strike a good balance between NOT over-fitting but NOT making so complex a model that it would not have any practical application. Then again, this might only be the case with my dataset, rather than an overall property of this approach!
In any event, I am glad that so many such algorithms and platforms are available in JMP Pro. I don't think I'll ever be able to go back to regular JMP now!
One thing to also keep in mind is your overall approach because it really sounds like wrt to your global problem, there is a two step approach,
Step 1: Explanatory modeling to try and characterize a system, hence modeling methods whose strengths are variable identification/reduction should be looked at first.
Step 2: Predictive modeling to make statements about future or new coral systems health and prospects for their futures. Hence models where predictive capability are the most desirable.
A nice webinar that helps differentiate the approaches is here: