Discussions

kjwx109 · Jun 5, 2023 04:50 AM

Hi,

I have working on an optimisation problem and have performed some PCA on the dataset. The top seven principal components explain about 80% of the variance. I was hoping to plan a DoE using these top seven principal components as variables. However, a colleague has indicated that this is rubbish as the later principal components explain comparatively little variance versus the earlier ones and as such their inclusion would lead to an inefficient use of experimental resource. (For info, he instead advocated for k nearest neighbour analysis, which I don't know much about). My query is whether it is still valid to do a DoE with variables (in this case principal components) that vary widely in their ability to explain data. (In my area, I often see DoEs proposed based on just the top three DoEs, even when they only explain a fraction of the overall variability.)

Victor_G · Jun 5, 2023 3:03 AM

Hi @kjwx109,

It might be difficult to help you without further details, I don't know for example :

If you would like to analyze the dataset you already have, or if you want to plan experiments based on this dataset ?
What is the goal of the analysis ?
Do you want to create a predictive model, or an explanative one (or maybe both) ?
Do you have high dimensionality issue ? What is the number of original factors/variables you have ?
Do you want to group/cluster individuals/observations or variables ?

On a general point of view, it may be valid to create a DoE on principal components variables, but it's better if these principal components variables make sense from domain expertise point of view : perhaps certain factors are "mixed" in specific PCs, and you can find a term that include these two correlated factors and explain why there might be linked together.

There are numerous options to reduce dimensionality if this is a problem you want to tackle, and you have options to do it on variables or observations/individuals. This may lead to different visualizations and results. I have answered to a similar topic about clustering here : JMP Community - Columns for clustering , and there are other ways to reduce dimensionality : Principal Components Analysis, Multivariate Embeddings (UMAP, t-SNE, ...)...

I don't see how K-Nearest Neighbors might be useful in this context, it's more often used in a predictive approach or a pre-processing step for missing values. It is not a variable selection method. You can find more info here : K Nearest Neighbors (jmp.com)

And you can see how it works in an interactive way (and the influence of dataset size, number of neighbors, ...) here : K-Nearest Neighbors (KNN) - Visualizing the Variance of the Decision Boundary / Antoine Broyelle | O...

I hope this first response will help you, looking forward to help you further with more context.

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

statman · Jun 5, 2023 03:11 PM

Victor asks some great questions. I would like to know how you actually manipulate a principle component and how you level set?

"All models are wrong, some are useful" G.E.P. Box

ih · Jun 5, 2023 04:51 PM

Are all of your variables independent, meaning you can directly set the value for each? If so then exploring only the space defined by those 7 components might be much smaller than the potential design space and you risk missing optimal points just because you never ran there in the past. Example: If the plant never modified temperature X, then a DOE using PCA components will also not move X.

On the other hand, that unknown space might make really bad product or just not work. If you can effectively analyze and visualize performance at the limits of all of those components and are confident that the optimal point is inside the space design then starting with that space might be the most efficient. Use preliminary results to verify this, and augment the design if you determine the design space is too small or too large.

Regarding the number of components, I think some analysis of those components is in order. If you really have 7 factors that you can control and if the potential benefit from optimizing the 7th is worth the extra runs it will take to explore that space then go for it. In some cases the first 4 components could be factors outside the facilities control at any given time, so the 5th component is the first one that is an actual lever that can be controlled. In other cases the difference between the best and worse settings for the 7th component might result in savings that do not even pay for the experiment.

Processes and designs are full of nuances. I hesitate to generalize too much, so to answer a question like this really depends on your process and your goals.

Phil_Kay · Jun 6, 2023 06:03 AM

Hi @kjwx109 ,

I am assuming that you have carried out a PCA on a large number of continuous descriptor variables on a library of different "candidate" materials in order to arrive at a smaller number of covariate factors to design an experiment to explore a diverse range of the candidates.

If that is the case, then I think the attached example might help to answer your questions. The goal here is to select a diverse range of solvents as part of optimising a chemical reaction process.

We have 9 descriptors of the solvents and there are strong correlations between many of them. Hence, PCA can be used to reduce the size of the problem: 2 PCs capture >70% of the variation in the solvent descriptors. (If there were not strong correlation between descriptors there would be no point in PCA - we would just use the descriptors as covariate factors in our experiment.)

Looking at the top 2 PCs, they make sense as "real" chemical properties of the system: the first seems to describe how "polar" or "water-like" the solvent is; and the second describes how "bulky" the solvent is. It is much more difficult to interpret the meaning of the third PC, which is only accounting for 9% of variation.

How many PCs should we use as covariate factors in our designed experiment? Certainly we would not use all 9 - the whole point is to reduce the size of the problem from the large number of correlated measurable descriptors to a smaller number of meaningful, underlying "latent" components that we believe are describing important properties of the system.

A general rule of thumb is to not consider PCs with eigenvalues less than 1, as they are describing less variation than one of the original variables. In this case, that would mean we would choose to use only the first 2 PCs. And the first 2 seem to make most sense chemically. In many cases like this, people use 3 PCs, but I suspect that is largely because it looks good on a 3D scatter plot!

We should use a combination of the statistics from the PCA and our domain understanding of the system to inform the choice of the number of PCs to carry forward as covariate factors.

There are alternatives like clustering, which would give us groups of similar candidates from the library. We could then choose a representative from each to give a diverse sample of candidates. However, that would not give us the insight that we get by using meaningful PCs as factors. And it would not enable us to design an experiment that best explores the properties of the candidates alongside other process factors that we might be considering. I am not sure how we would use k nearest neighbours but I imagine it would have the same disadvantages.

I hope this helps.

Phil

Discussions

DoE using principal components as variables

Re: DoE using principal components as variables

Re: DoE using principal components as variables

Re: DoE using principal components as variables

Re: DoE using principal components as variables

Recommended Articles