cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
ehchandlerjr
Level V

Principal component regression

Hello - I have been working with and using PCA in JMP for about four years now. Nothing fancy, but I feel very comfortable with a lot of the functionality. I've been using the PCs as factors for DoE, thinking I was some kinda smartie and came up with a new use case for PCA. Low and behold, I stumble across a Wikipedia page last week on principal component regression. Ha.

Anyway, I have looked and looked, and don't see any reference, much less functionality, in JMP for PCR (or integration with DoE for that matter). Am I missing something, and does this functionality exist in JMP, even if only in JSL form? If not, does anyone have experience with it? Would you just use the inverted correlation matrix on the PCs, basically the inverse of the PC formula column?

Thanks!
Edward Hamer Chandler, Jr.
9 REPLIES 9
Thierry_S
Super User

Re: Principal component regression

Hi,

I dabbled with the same idea and was told that was nonsense on this board. Still, there may be a labor-intensive way to get what you want.

  1. Calculate the Principal Components from your data set (Multivariate Methods > Principal Components)
  2. In the report, go to Save Columns > Save Principal Components Values. The number of PCs you save depends on your data, and I tend to use PCs that capture at least 80% of the data variation
  3. Then, treat the PCs as any variables in your model (e.g., Response Screening, Fit Model,..)

Note: This approach is directly inspired by the workstream in R as explained on this page (LINK)

Best,

TS

 

Thierry R. Sornasse
ehchandlerjr
Level V

Re: Principal component regression

Thanks for the reply! So what is people's beef with PCR? Is it a fundamental statistical issue, or is it something else? Like i have a paper in using that takes 700 solvents across 100 properties. Seems highly unlikely there are 100 truly independent, fundamental aspects to the solvent space, and PCR seems like a reasonable way to reduce it to a manageable size. I'm sure there are many other places too for zeolites, minerals, etc. And that's just chemical. 

Edward Hamer Chandler, Jr.
ehchandlerjr
Level V

Re: Principal component regression

I guess I'm also asking what's the best way to invert the regression model that you get back into terms of the original column space? I guess full disclosure: I'm using these PCs as factors in an experimental design. Do you make the model in terms of the PCs and apply some inverse correlation matrix to it or do you just then do some sort of fit to all of the factors and see which fit best to get a more physically based model?
Edward Hamer Chandler, Jr.
Victor_G
Super User

Re: Principal component regression

Could a Partial Least Squares regression be a more simple solution ?

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

Re: Principal component regression

I'm not sure that people have "beef" with PCR, but it has some limitations. Ultimately, you must be certain about what you want from your modeling effort. 

PCR is a biased regression technique. This is not necessarily bad, but your model would be predictive only. It is not intended to provide any insights into what CAUSES things to happen like a designed experiment would. It is truly just built on correlations, with no claims on causation.

When you perform PCA on the X's as @Thierry_S  suggests, only keeping the first few principal components is only considering the X's, not the target or response variable, Y. So you may not end up with the best model.  The last principal component might have the best predictive ability of the target, so you should keep ALL principal components to build your PCR model. Once you fit a model with all PCs, you can then remove the insignificant ones to get a final model. 

Your last post gets at the major issue with PCR: how do you translate back to the original X space. There are several possible approaches. One approach is to keep the factor that has the highest loading from each of the significant PCs. A second approach is to remove the factor that has the lowest loading from each of the significant PCs. And there are other possibilities. Which approach you choose will depend on what you want from a model.

Dan Obermiller
ehchandlerjr
Level V

Re: Principal component regression

Hmmmmmm. I think understand your concerns @Dan_Obermiller.

Ok so here's a thought. And this is delving into stuff I'm really not familliar with so let me know if this is absurd. Even though the Experimental Design using PCs results in a model in terms of PCs, the y vector is still in real units. So could one do what @Thierry_S suggested, generating the PCs. This way you are able to generate a design that at least tries to pick points that are along directions of maximum variance. Load those into the DoE platform as covariates. And then, I was just looking through the PLS platform @Victor_G shared, maybe instead of making a model in terms of the PCs, do a PLS regression of the final data against the original data set.

That way you are just using the PCs to make sure the usage of the original column space is maximized in your design, but you're using a regression model that prioritizes the original, physically grounded column space.

Is this reasonable? My first worry is that this just relegates the optimality criterion to the proverbial back room, but from what I gather, PLS is designed to handle highly correlated variables, so the need for optimality criterion is lessened? There might be other issues as well.

I'm just a lowly chemical engineer, so let me know if I'm not treating the statistics well here.
Edward Hamer Chandler, Jr.
ehchandlerjr
Level V

Re: Principal component regression

Also, @Victor_G, just saw you have L'Oreal in your tag line. I actually just applied to L'Oreal. No idea if I'll get it, but I was stressing about losing JMP Pro by moving to another job. Glad to know if I get a job there, I won't lose access!
Edward Hamer Chandler, Jr.
Victor_G
Super User

Re: Principal component regression

@ehchandlerjr for the design creation, I see two options :

  1. Using PCs as the factors in the design (with the risks mentioned by @Dan_Obermiller in terms of predictivity, but you might have a good experimental space coverage),
  2. Directly use your factors as covariates, since depending on which factors you have, you might not be able to change the levels independently : for example with chemical characteristics, you can't change molecular weight, topological surface area, carbon chain length fully independently for a set of molecules. Preparing a candidate set with all possible combinations of your factors and using these factors as covariates could help stay in the same original inputs, enable a good coverage of your experimental space, and directly model your responses with these factors (through PLS or other models). More infos here: What is a covariate in design of experiments?
    Developer Tutorial - Handling Covariates Effectively when Designing Experiments - JMP User Community

 

Good luck for L'Oreal, keep me informed
Best,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)
statman
Super User

Re: Principal component regression

Perhaps my definition of an experiment is different than yours? I guess my practical question is:  What are you actually manipulating in the experiment?  I don't think the PC is necessarily directly translatable to a factor or even a set of factors that are or can be specifically and independently manipulated?

 

You can certainly regress on those, but I'm not sure this is an experiment?  I mean if all you have is covariates, that is not an experiment, that is regression.  Usually experiments include factors that are considered fixed effects.  Covariates random effects.  Together you get a mixed model.  You are typically limited in the number of random variables you can account for in the mixed model.

"All models are wrong, some are useful" G.E.P. Box