Principal Component Analysis (PCA) is a traditional method in data analysis and, more specifically, in multivariate analysis. PCA was developed by Karl Pearson in 1901.

The goal of PCA is to reduce the dimensionality in a set of correlated variables into a smaller set of uncorrelated variables that explain the majority of the variation in the original variables. It is a helpful technique when exploring the correlation structure in a set of predictor variables and in predictive modeling or Principal Component Regression (PCR) where the principal components are used as predictors in regression models.

Here is a simple example: Let's consider thisdata set about cars from *Probabilités, Analyse de données et Statistique*.

The data set has 18 rows with eight numerical columns. We could investigate relationships between pairs of variables. For example, maximum speed is correlated with power.

But with a large number of variables, it is more convenient to summarize the pairwise relationships with a correlation matrix, and we observe a large number of positive correlations. We also see that maximum speed is most strongly correlated with cylinder and power.

Because of the correlation structure in this set of variables, we can expect to reduce the number of dimensions using PCA.

We see that we have two distinct groups regarding the second component. One group is related to size characteristics of the car (length, width, weight), and the second group is related to motor characteristics (power, maximal speed, cylinder).

**Illustrative Variables**

Illustrative variables have no impact on the construction of the new components. They are used to help interpret the components and relate them to other variables (in this case, price and weight to power ratio).

You can download the JMP add-in for PCA with illustrative variables from the JMP File Exchange. (Download requires a free SAS profile.)

This add-in will allow you to add illustrative variables to a PCA .

Using our previous example, PCA with illustrative variables allows us to visualizethe relationship between the price and ratio of weight to power (orange variable in the previous data set) and the principal components without using these additional variables to build the PCA.

We can see that price is clearly correlated with all variables and grouped with the number of cylinders, power and speed-max. So an expensive car is a big car or a powerful car.

The second illustrative variable, the weight to power ratio, is negatively correlated with the power of the car regarding the two components and is negatively correlated with weight regarding the first component. It is clearly due to the fact that the weight can have negative impact on the power of a car.

**References**

G. SAPORTA, « Probabilités, Analyse de données et Statistique », TECHNIP, 2006 (in French).

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.