Re: PCA to separate two outcomes

Report Inappropriate Content · Jun 8, 2023 5:28 PM

Hi,

I have a binary outcome for a set of patients, and a long list of variables measured (i.e. potential predictors/characteristics). I can do a PCA on the whole set, or PCA on each set separately (i.e. positive patients and negative patients separately). But what I really want is the following: what are the two principal components that best separate the positives from the negatives? I've played a bit with the discriminant analysis but haven't got too far, any help or suggestions would be welcome.

Thanks!

uriel

Thierry_S · Feb 16, 2021 11:34 AM

Hi,
Have you considered the Discriminant Analysis platform, or the Response Screening platform if you need to filter / focus which variables are the most relevant to your classification.
Best,
TS

Thierry R. Sornasse

SDF1 · Feb 16, 2021 11:37 AM

Hi @utkcito ,

PCA might not be the right tool for you. Since you have a binary outcome (on/off, 1/0, dead/alive), it sounds more like a logistic regression problem, much like the Titanic survivors example in the sample data in JMP. The purpose of PCA isn't necessarily for the prediction of an outcome, i.e. separating out 0/1 outcomes. You might be better off with a decision tree, neural net, or other tree-based methods like XGBoost. You could even try support vector machines. Many of those options depend on what version of JMP you have (e.g. Pro or standard).

PCA is really designed to find a set of linearly independent vectors (of the predictor data set -- you would not feed it the response data) that maximizes the variability explained in the data along a set of orthogonal principle components. You can read up on it in JMP's help here. The principle components won't necessarily be good predictors to separate the positive/negatives. This might be much better suited with the SVM platform. If you don't have Pro, you'll probably want to try either the partitioning decision tree method or the neural net.

Another thing: depending on how many rows you have, you might want to generate a validation column stratified on the outcome column and use that for a validation of your model. You might also consider adding a null factor column (see the autovalidation add-in for JMP) to see if the set of possible predictors actually shows up more often in the model than a completely random and orthogonal null factor.

Hope this helps.

Good luck!,

DS

utkcito · Feb 16, 2021 01:07 PM

Hi TS and DS,

I am not looking for predictors with the PCA. I want to characterize, describe. I think the discriminant analysis should do it under the wide linear method as it shows principal components, but can't find the details of the principal components it generates to assess if that really is the case.

Uriel.

ih · Feb 16, 2021 11:44 PM

You might consider PLS-DA instead of PCA. PLS ensures that the first components relate to the most variation in your target, in other words for a single Y you should always see the best separation in score plots of components 1 and 2. You could then look at loadings, VIP v Coefficients, or use methods you are used to in PCA to assess variable importance.

You can launch PLS-DA either using the PLS Personality of the Fit Model platform, or by creating creating a variable that is either 1 or 0 depending on the category, and then including that in the 'Y' section of the PLS platform.

SDF1 · Feb 17, 2021 08:21 AM

Hi @utkcito ,

I agree with @ih , the PLS platform might be better suited to what you're wanting to do, or the SVM platform too.

One reason why the discriminant analysis might not be working is that it sounds like your data set might not entirely fit the expected format for the DA. The DA requires categorical X's and continuous responses Y's, and it is kind of the inverse, using the continuous Y's to ultimately predict the X's, i.e. it predicts a classification variable based on a known continuous Y variable.

If you want to use the X's to characterize/describe the outcome Y's, then PLS, partitioning, or SVM might be better platforms, I think. And as @ih mentioned, using the VIP plots, you can use that along with the VIP threshold to generate a simplified model where it's not using every single X column in the model. Clustering could also be an option, but it might not make as much sense as the PLS when trying to interpret the results.

Hope you can use one of those platforms to help your work.

Good luck!,

DS

Bill_Worley · Feb 17, 2021 02:02 PM

Hello @utkcito,

While I fully agree with the others on potential options for analyzing your data I would like to offer up another JMP Pro platform that will be well worth your time to try out. Generalized Regression (GR) is extremely well suited for binomial/binary outcomes. The LASSO and Elastic Net algorithms will allow you to find the best overall combination of variables in the factor space you are working and not in a principal component or latent factor(PLS) space. Both PCA and PLS models can be difficult to determine what is most important without a lot of extra effort. You also have many options for cross-validation to avoid overfitting.

Neural Nets (NN) might be useful to you. The TanH, Hyperbolic Tangen function is very well suited for binomial outcomes and you will be automatically directed to use cross-validation to avoid overfitting. (JMP and JMP Pro)

Bootstrap Forest as a partitioning method will also allow you to find the most significant factors in your factor space. (JMP Pro)

One last thought, if you have access to JMP 16 EA you can try Model Screening under Predictive Modeling to help guide you to the best overall model for your data.

Model Screening will be one of the great new features in JMP Pro 16

utkcito · Feb 20, 2021 11:12 AM

Thanks to all that responded, your comments are helpful - I've in fact used all these options in the recent past to analyze the data. I was interested particularly in how an orthogonal discrimination would look like, but if not directly possible I'll move on.

Uriel.