cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar

Help with PCA analysis

Hello everyone,
I am trying to make a comparison between different populations of an animal according to a variety of morphological features. I have numerical data (such as length, weight, etc.) and binary data. Not every one of the details I have has all the data, meaning that each characteristic has a different number of details with information available. I am trying to do a PCA on the data in order to get a division into distinct populations according to differences in my many parameters. I would be happy to receive an explanation of how I can get a result in two or three dimensions that will give me a convenient division into populations.
Thanks in advance

1 REPLY 1
SDF1
Super User

Re: Help with PCA analysis

Hi @yoram_frumkin ,

 

  First of all, welcome to the community, this is a great place to bounce ideas off of people and get help with all things data related (not necessarily just JMP). There are a lot of experts here.

 

  PCA is a great tool, and one of many you should consider using in analyzing your data -- meaning that it is often beneficial to examine your data using different platforms in JMP. One somewhat downside of PCA is that it prefers to work with continuous numerical data, it can't handle nominal, and can work with, but will complain about ordinal data.

 

  Anyway, to address the question of not always having all the data in each attribute for a given row (sample), one thing you can do (and I say this with more than a grain of salt for caution) is to impute missing data. You can do this under Analyze > Screening > Explore Missing Values and then use of the several different types of imputation. This can sometimes work well, and sometimes not work so well, which is why I caution using it or relying on it too much. You might try creating duplicate data tables where you use the different imputation methods and see how well they work and if the results make sense. But, it's also a good idea to color/highlight/change markers for those rows where you have done imputation because it can sometimes be hard to go back and check on where those are once JMP has filled in the data.

 

  There are some other platforms you should look into that might actually work better without needed to impute missing data and still get you what you're after. For example, you could use hierarchical clustering, k-means clustering, K-nearest neighbors (JMP Pro), support vector machines (JMP Pro), Partition, or partial least squares (sometimes called projection to latent structure). All of these in some form or another essential do dimensional reduction to simplify the data. Of course, the algorithms are all different, and you might get different results, but it's often good practice to compare results across platforms to gain better consistency and understanding of the classification.

 

  If you have JMP Pro, there are several other platforms like boosted tree, bootstrap forest, and XGBoost that are often quite powerful at classification modeling, which it sounds like what you're trying to do. Plus, missing data doesn't really affect their algorithms all that much.

 

  Can you anonymize the data table and share it (Tables > Anonymize)? This might give others a chance to play with the data and provide some more specific feedback about how you could tackle the problem on your own.

 

Hope this helps the discussion!,

DS