Hi @jlmbs,
It is hard to tell if there is one best method for your use case, as it may depend on several parameters :
- Your objective : are you more interested in predictivity (predicting a protein property thanks to structural properties, like QSPR (Quantitative Structure Property Relationships) models) or explainability (grouping/clustering similar proteins to better understand some structure/properties similarities for example) ?
- The dataset characteristics : missing values, outliers, degree of multicollinearity... : depending on these dataset characteristics, some methods may be more appropriate than others.
- The type of modeling (linked to your objective) : Some models have "built-in" features selection (for example Random Forest, LASSO/Elastic Net in Generalized Regression ...), some are more appropriate for explainability/interpretability (when looking at tree-based models, decision tree and random forest are interesting for explainability, while Boosted Tree are harder to interpret but may have very good predictive performances), and models behave differently when facing multicollinearity (Ordinary Least Squares is totally not appropriate when there is multicollinearity, while Partial Least Squares easily handle multicollinearity in situations when there are a large number of highly correlated explanatory variables, which is often the case in chemical/chemoinformatic use cases).
There are other parameters that could influence the model choice, such as the expected "prediction profile" or model's complexity (that may be linked to the previous points mentioned earlier and influenced by domain expertise) : some methods (like tree-based methods) will create prediction models with "steps", while others will create "smooth" prediction models/profiles with curvature (Neural Networks, Support Vector Machines, ...) and some may create linear and straight relationships between inputs and output(s) (PLS, some other regression models, ...).
As you can see, there are a lot of options (some may depend if you are on JMP or JMP Pro).
Before going to analysis directly, I would visualize the correlations between chemical descriptors with response(s), as this could give an indication about correlated variables and influential inputs for your analysis (platform "Multivariate" in Analyze -> Multivariate Methods).
You can also spend some time checking for Missing values and outliers, and checking VIF score on a baseline model (not used for predictive/explanative performance, but just to have access to the calculation of VIF scores to better understand the magnitude of multicollinearity you have in your dataset).
Then, some options to consider :
- JMP :
- A first approach could be to do a Principal Component Analysis (PCA), and then try to model the default response with these principal components (or try a clustering method with the principal components). Drawback of this approach is that it will make you lose some explainability about which factors impact the defaults, as they will be combined in PCA variables.
- Another option would be to try Partial Least Squares (PLS), a very efficient approach when you deal with a lot of correlated variables and very few observations. It is for example used in spectral analysis (chemometrics) but also in some QSPR models (chemoinformatics), where you have few observations but a lot of input data (transmittance/intensity for each wavelength for example or a lot of highly correlated chemical/molecular descriptors).
- Predictor Screening is also a good tool to have an overview about which factors may be the most influential on the response, and take care "natively" of multicollinearity (since the model behind is a Random Forest).
- JMP Pro :
- Instead of Predictor screening, you can try "Random Forest", where you'll have a lot more infos on the model and possibility to save formula. Since RF uses features bagging for each tree, you can be confident that even in case of multicollinearity and more parameters than observations, each parameter will have the same probability to be selected in a tree, so you'll have a good view on which factors impact the defaults.
- You can also look at Generalized Regression models with penalized estimation methods like Lasso, Ridge or Elastic Net.
These would be my first models to try.
In the clustering phase, check if you have :
- "contextual"/multidimensional outliers : points that are not outliers in all the dimensions, for which K-Nearest Neigbors, Mahalanobis/Jackknife Distances or PCA might be an interesting approach.
- collective outliers, for which K-means may be useful, but you can compare the results obtained with Gaussian Mixtures, as it may be more "flexible" regarding the shape of the clusters (not necessarily as spherical as K-Means can be).
Hope this (long) answer will help you,
Victor GUILLER
"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)