Subscribe Bookmark RSS Feed

PCA estimation method for missing values (log transformed)

schmops

Community Trekker

Joined:

Aug 23, 2016

Hi there,

I am a beginner in statistics and also with JMP, so I hope someone can help me with my question below. I am trying to use JMPs PCA co-variance module on a relatively small environmental data set of 80 rows x 10 columns (800 values). My goal is to reduce the data set so I can predict whether the 10 variables (in the columns) co-vary across 80 locations (80 rows). The variables have all the same units and I log10-transformed the data.

My problem is that I have missing values for some cells (i.e. locations where the variable was not detected at the LOD). When I use the default estimation method (REML), I get about 80% of the variance explained by PC1. But when I use the exact data with the Row-Wise method, this value drops to 60%. I would like to understand why. I understand that REML is using all my data and the Row-Wise method apparently omits the missing values. But what does omit mean in this respect, using zero instead and thus decreasing the calculated variance? The overall question for me is, which method is most conservative in interpreting the data when the missing values are due to non-detects in the data set?

Also, I tried to add +1 to the log10-transformed data set to avoid the missing value problem (tip from another forum), but JMP does not allow me to add +1 to the empty log10-transformed cells. I am using JMPPro 12.

Thanks a lot for your help! Sascha

6 REPLIES
jiancao

Staff

Joined:

Jul 7, 2014

Row-wise method excludes ENTIRE rows from estimation wherever there is at least one single missing cell; it doesn't substitute a 0. REML uses all of non-missing cells in its estimation, in other words, only the missing cells in that row are not used. So the actual data used by Row-wise is less than by REML when there are missing data.

It is hard to say for certain which method is more conservative or better--it depends on the nature of missing. Are those non-detects completely at random?  If so, row-wise method may be OK to use. Otherwise use REML as it is essentially a missing data imputation method by fitting a mixed model, but just be aware of the multivariate normal distribution assumed. Personally I would go for REML.

schmops

Community Trekker

Joined:

Aug 23, 2016

Thank you so much, Jiancao!

REML was my gut feeling as well since the non-detects are not completely random (e.g. some variables were not detected in a certain group of all my 80 locations).

You mentioned I should be aware of the multivariate normal distribution that I would assume, using REML. What exactly do you mean with this and how can I check for this normal distribution of my data? What I did so far is to check the distribution of each variable using the Distribution-Normal Quantile Plot in JMP. Just by eye I can see that all points for each variable follow a linear pattern within the confidence intervals. Is that a legit way to do so?

Thanks again,

Sascha

jiancao

Staff

Joined:

Jul 7, 2014

The REML assumes a linear multivariate normal model, which implies that every variable is linearly related to every other variable, and each variable is normally distributed. What you did is legit. Just make sure your variables are not highly skewed due to extreme outliers, etc.

schmops

Community Trekker

Joined:

Aug 23, 2016

Great, thank you so much, Jiancao! That helps a lot!

Dan_Obermiller

Joined:

Apr 3, 2013

jiancao answered the modeling question, but you stated that JMP would not allow you to substitute a 1 in place of the missing values on the log transformation. Actually, you can, but you just need to add a conditional statement like this:

12652_pastedImage_0.png

Dan Obermiller
schmops

Community Trekker

Joined:

Aug 23, 2016

Thank you DanO, I will try using this statement with my data!