Principal components or factor analysis?

LauraCS · Apr 25, 2017 02:11 PM

Should I use principal components analysis (PCA) or Exploratory Factor Analysis (EFA) for my work? This is a common question that analysts working with multivariate data, such as social scientists, consumer researchers, or engineers, face on a regular basis.

In this post, I share my favorite example for explaining a key difference between PCA and EFA. This distinction opens the door to explaining other important differences and is helpful when figuring out which technique is most appropriate for a given application. Choosing improperly might mean misleading results or incorrect understanding of the data.

An Illustrative Example

Let’s start by creating some data that follow a standard normal distribution (a JSL script for all analyses in this post is attached if you want to follow along). Specifically, I create a data table with 1,000 observations on four variables that are uncorrelated with each other.

We can use the Multivariate platform in JMP to look at the correlations between variables and confirm they are independent. I particularly like to use the Color Map on Correlations to illustrate the null correlations in the off-diagonal:

Figure 1. Correlation coefficients and corresponding heat map for four simulated variables. Figure 1. Correlation coefficients and corresponding heat map for four simulated variables.Now, we can ask ourselves an important question: What would the results look like if I use PCA on these data? And what would the results look like if I use EFA instead? If you’re not sure, keep reading.

Let’s use the Factor Analysis platform in JMP to perform simultaneously a PCA and an EFA on these data. I’ll retain one component/factor only because we have a small number of variables. This also eliminates the need for any rotation.

The component or factor loadings from the analyses are critical to help us understand what the component or factor represents; variables with high loadings (usually defined as .4 in absolute value or higher because this suggests at least 16% of the measured variable variance overlaps with the variance of the factor) are most representative of the component or factor. Below, we can compare the resulting component loadings (displayed first) against the factor loadings (displayed second).

Figure 2. Component and factor loadings from a principal components analysis and exploratory factor analysis on four simulated variables that are uncorrelated. Figure 2. Component and factor loadings from a principal components analysis and exploratory factor analysis on four simulated variables that are uncorrelated.

We can see the results are strikingly different! PCA gave us three loadings greater than .4 in absolute value whereas EFA didn’t give any. Why? Because when we do EFA we are implicitly requesting an analysis on a reduced correlation matrix, for which the ones in the diagonal have been replaced by squared multiple correlations (SMC). Indeed, a quick look at the Color Map on Correlations of the reduced correlation matrix sheds light on why one would obtain such different results:

Figure 3. Heat map of reduced correlation matrix, where unities in the diagonal have been replaced by squared multiple correlations. Figure 3. Heat map of reduced correlation matrix, where unities in the diagonal have been replaced by squared multiple correlations.

Every entry in the reduced correlation matrix, in this example, is very small (nearly zero! The actual values are 0.002, 0.002, 0.004, and 0.001). An eigenvalue decomposition of the full correlation matrix (Figure 1) is done in PCA, yet for EFA, the eigenvalue decomposition is done on the reduced correlation matrix (Figure 3). Differences in the data analyzed help explain differences across analyses, but none of this tells us what the differences mean from a practical point of view….

Practical Meaning of Analyzing a Full vs. Reduced Correlation Matrix

PCA and EFA have different goals: PCA is a technique for reducing the dimensionality of one’s data, whereas EFA is a technique for identifying and measuring variables that cannot be measured directly (i.e., latent variables or factors). Thus, in PCA all of the variance in the data — reflected by the full correlation matrix — is used to attain a solution, and the resulting components are a mix of what the variables intended to measure and other sources of variance such as measurement error (see left panel of Figure 4).

By contrast, in EFA not all of the variance in the data comes from the underlying latent variable (see right panel of Figure 4). This feature is reflected in the EFA algorithm by “reducing” the correlation matrix with the SMC values. This is appropriate because a SMC is the estimate of the variance that the underlying factor(s) explain in a given variable (aka communality). If we carried out EFA with unities in the diagonal, then we would be implicitly saying the factors explain all the variance in the measured variables, and we would be doing PCA rather than EFA.

Figure 4. Graphic comparison of principal components analysis and exploratory factor analysis. Figure 4. Graphic comparison of principal components analysis and exploratory factor analysis.

Figure 4 also illustrates another important distinction between PCA and EFA. Note the arrows in PCA are pointing from the measured variables to the principal component, and in EFA it’s the other way around. The arrows represent causal relations, such that the variability in measured variables in PCA cause the variance in the principal component. This is in contrast to EFA, where the latent factor is seen as causing the variability and pattern of correlations among measured variables (Marcoulides & Hershberger, 1997).

In the interest of clarity, it’s important I outline a few more observations. First, most multivariate data are correlated to some degree, so differences between PCA and EFA don’t tend to be as marked as the ones in this example. Second, as the number of variables involved in the analysis grows, results from PCA and EFA become more and more similar. Researchers have argued that analyses with at least 40 variables lead to minor differences (Snook & Gorsuch, 1989). Third, if the communality of measured variables is high (i.e., approaches 1), then the results between PCA and EFA are also similar. Finally, this favorite example of mine relies on using the “Principal Axis” factoring method, but other estimation methods exist for which results would vary. All of these observations must be taken into account when analysts make the choice between EFA and PCA. But perhaps most importantly for psychometricians (those who developed EFA in the first place) is the fact that EFA posits a theory about the variables being analyzed; a theory that dates back to Spearman (1904) and suggests unobserved factors determine what we’re able to measure directly.

I list some key points below, but note that an excellent source for continuing to learn about this topic is Widaman (2007).

Key Points

PCA is useful for reducing the number of variables while retaining the most amount of information in the data, whereas EFA is useful for measuring unobserved (latent), error-free variables.
When variables don’t have anything in common, as in the example above, EFA won’t find a well-defined underlying factor, but PCA will find a well-defined principal component that explains the maximal amount of variance in the data.
When the goal is to measure an error-free latent variable but PCA is used, the component loadings will most likely be higher than they would’ve been if EFA was used. This would mislead analysts into thinking they have a well-defined, error-free factor when in fact they have a well-defined component that’s an amalgam of all the sources of variance in the data.
When the goal is to get a small subset of variables that retain the most amount of variability in the data but EFA is used, the factor loadings will likely be lower than they would’ve been if PCA was used. This would mislead analysts into thinking they kept the maximal amount of variance in the data when in fact they kept the variance that’s in common across the measured variables.

References

Marcoulides, G. A., & Hershberger, S. L. (1997). Multivariate statistical methods: A first course. Psychology Press.

Snook, S. C., & Gorsuch, R. L. (1989). Component analysis versus common factor analysis: A Monte Carlo study. Psychological Bulletin, 106, 148-154.

Spearman, C. (1904). "General intelligence," objectively determined and measured. The American Journal of Psychology, 15, 201-293.

Widaman, K. F. (2007). Common factors versus components: Principals and principles, errors and misconceptions. Factor analysis at 100: Historical developments and future directions, 177-203.

shannon_conners · ‎04-28-2017

Laura, I really appreciate that you took the time to explain the difference between these techniques. This was very well written and informative and I loved your use of graphs and diagrams for illustrating your points. Thank you!

LauraCS · ‎04-28-2017

Thanks Shannon!

Young · ‎02-05-2018

Hi Laura,

The last diagram for comparison of EFA and PCA is very helpful to undrestand. Could you let me know a source, if this is from somewhere else, to use it in my presentation?

Thank you!

Best,

Young

LauraCS · ‎02-05-2018

Young,

I'm glad you found the last diagram helpful. I created that image for this post and am not aware of other sources that have something similar.

Best,

~Laura

Young · ‎02-05-2018

Hi Laura,

Thank you for the reply. I will then cite this website. Thanks for the very
useful diagram.

Best,
Young

abmayfield · ‎08-19-2018

Super helpful. In fact, so helpful that I'll be incorporating these ideas into an upcoming grant on coral work!

LauraCS · ‎08-20-2018

So glad to hear! =)

Peter_Hersh · ‎08-02-2019

Hey Laura,

I love this example comparing PCA to EFA. I was wondering if you could comment on the differences between Latent Class analysis and EFA?

LauraCS · ‎08-08-2019

Hi Peter (@phersh),

The key difference is the conceptualization of the latent variable. That is, in EFA the factors are continuous whereas in LCA they are categorical. Moreover, the variables used to extract the factors in EFA are also continuous but in LCA they're categorical. Because LCA estimates categorical latent variables, it's used for classification. I made the image below using the measurement level icons in JMP to drive the point further. Hope this helps!

Peter_Hersh · ‎08-14-2019

Thanks Laura,

I appercaite the answer. Is it correct to say that MDS is similar to PCA for catergorical variables and LCA would be comprable to EFA for catergorical variables?

LauraCS · ‎08-19-2019

@Peter_Hersh I think it would be more appropriate to say that Multiple Correspondence Analysis (MCA) is similar to PCA for categorical variables. Also, Item Response Theory (IRT) is comparable to [Confirmatory] Factor Analysis (CFA) for categorical [observed] variables, and LCA is like IRT but with categorical latent variables.

To clarify further, in EFA one doesn't know a priori the "configural" structure of the data. That is, which observed variables are caused by the latent variables? In contrast, this is known in CFA and the analyst specifies the model according to that knowledge. The latter is also true for IRT and LCA.

We can also distinguish all these techniques based on their ability to estimate latent variables. From this perspective, EFA, CFA, IRT, and LCA, are all models to estimate "legitimate" latent variables. I use the word legitimate because the latent variables in these models cause the observed variables. On the other hand, MCA and PCA are models for dimension reduction, without the understanding that an unobserved/latent variable causes the observed variables.

DirtyDataCoLLC · ‎04-02-2022

@LauraCS, thank you. I reread your explanation. I find reading a thorough explanation on complex topics more than once is beneficial.