Hi @kjwx109 ,
I am assuming that you have carried out a PCA on a large number of continuous descriptor variables on a library of different "candidate" materials in order to arrive at a smaller number of covariate factors to design an experiment to explore a diverse range of the candidates.
If that is the case, then I think the attached example might help to answer your questions. The goal here is to select a diverse range of solvents as part of optimising a chemical reaction process.
We have 9 descriptors of the solvents and there are strong correlations between many of them. Hence, PCA can be used to reduce the size of the problem: 2 PCs capture >70% of the variation in the solvent descriptors. (If there were not strong correlation between descriptors there would be no point in PCA - we would just use the descriptors as covariate factors in our experiment.)
Looking at the top 2 PCs, they make sense as "real" chemical properties of the system: the first seems to describe how "polar" or "water-like" the solvent is; and the second describes how "bulky" the solvent is. It is much more difficult to interpret the meaning of the third PC, which is only accounting for 9% of variation.
How many PCs should we use as covariate factors in our designed experiment? Certainly we would not use all 9 - the whole point is to reduce the size of the problem from the large number of correlated measurable descriptors to a smaller number of meaningful, underlying "latent" components that we believe are describing important properties of the system.
A general rule of thumb is to not consider PCs with eigenvalues less than 1, as they are describing less variation than one of the original variables. In this case, that would mean we would choose to use only the first 2 PCs. And the first 2 seem to make most sense chemically. In many cases like this, people use 3 PCs, but I suspect that is largely because it looks good on a 3D scatter plot!
We should use a combination of the statistics from the PCA and our domain understanding of the system to inform the choice of the number of PCs to carry forward as covariate factors.
There are alternatives like clustering, which would give us groups of similar candidates from the library. We could then choose a representative from each to give a diverse sample of candidates. However, that would not give us the insight that we get by using meaningful PCs as factors. And it would not enable us to design an experiment that best explores the properties of the candidates alongside other process factors that we might be considering. I am not sure how we would use k nearest neighbours but I imagine it would have the same disadvantages.
I hope this helps.
Phil