I ran PCA on the attached Data Set.
The output is :
I am trying to increase the R-Square Value. In doing so, I want to only consider important variable.
Till now I have cleaned the data by excluding the outliers and also by removing variables : Garage yrblt ,1stflsf,totrmsabvgrd & Garagecars as they have very high correlation.
Recalculated the Lot frontage as Updated Lot Frontage as lot frontage had a lot of missing variables.
Now my PCA analysis shows that 12 (out of 26) components can explain around 79.5% of the variability. So, am I correct if I save the formula for these 13 components and then exclude the variables that are used to get these component such as LotFrontage to UpdateLotFrontage.
So, Basically instead of using 26 variables now I will use 13 variables. After this variable reduction, I plan to run regression stepwise etc , neural network etc and try to see which method gives the best R-Square value
Is this approach correct??
Can you say a little more about what your objectives are? Given that you mention R-Squared, I guess that you want to build a model (possibly predictive, possibly explanatory or descriptive) relating at least one response ('Y') to some independent variables (Xs). If that's the case, which is the 'Y' variable in your table, please? Additionally, if you are wanting to use PCA, I'm not sure why you would remove variables 'as they have very high correlation' - PCA relies on such correlations to reduce the dimensionality of the problem.
My Y variable is Salesprices. It is a Predictive Model.
I am trying to get the best fit model with higher R -Square.
So, If I dont remove the highly correlated variables and then run the PCA.
How do I decide on which components to consider (By runing a regression:stepwise?) and what should I do with the initail set of Variables. Do I consider the initial set of variables or just the new components while building a model?
Generally PCA is a dimensionality reduction and visualization method for correlation structures among variables and not really intended for a variable identification/elimination method as a precursor to building predictive models. There are many different techniques native to JMP and JMP Pro that can be used to directly build models even in the face of suspected correlated predictor variables. Techniques such as partial least squares, or in JMP Pro, many of the subplatforms in the Generalized Regression Fit Model personality are tailor made for variable identification in the face of correlated predictor variables.