turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- JMP User Community
- :
- Discussions
- :
- Discussions
- :
- PCA

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Apr 19, 2017 2:48 PM
(1498 views)

Hi,

I ran PCA on the attached Data Set.

The output is :

**I am trying to increase the R-Square Value. In doing so, I want to only consider important variable.**

Till now I have cleaned the data by excluding the outliers and also by removing variables : Garage yrblt ,1stflsf,totrmsabvgrd & Garagecars as they have very high correlation.

Recalculated the Lot frontage as Updated Lot Frontage as lot frontage had a lot of missing variables.

Now my PCA analysis shows that 12 (out of 26) components can explain around 79.5% of the variability. So, am I correct if I save the formula for these 13 components and then exclude the variables that are used to get these component such as LotFrontage to UpdateLotFrontage.

So, Basically instead of using 26 variables now I will use 13 variables. After this variable reduction, I plan to run regression stepwise etc , neural network etc and try to see which method gives the best R-Square value

Is this approach correct??

Thanks

4 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Apr 20, 2017 6:40 AM
(1473 views)

Can you say a little more about what your objectives are? Given that you mention R-Squared, I guess that you want to build a model (possibly predictive, possibly explanatory or descriptive) relating at least one response ('Y') to some independent variables (Xs). If that's the case, which is the 'Y' variable in your table, please? Additionally, if you are wanting to use PCA, I'm not sure why you would remove variables 'as they have very high correlation' - PCA relies on such correlations to reduce the dimensionality of the problem.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Apr 20, 2017 7:53 AM
(1463 views)

My Y variable is Salesprices. It is a Predictive Model.

I am trying to get the best fit model with higher R -Square.

So, If I dont remove the highly correlated variables and then run the PCA.

How do I decide on which components to consider (By runing a regression:stepwise?) and what should I do with the initail set of Variables. Do I consider the initial set of variables or just the new components while building a model?

Regards,

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Apr 22, 2017 6:18 PM
(1396 views)

What about outliers? Do we remove the outliers before running a PCA or PCA takes care of both the outliers and missing variables

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Apr 24, 2017 8:48 AM
(1364 views)

Generally PCA is a dimensionality reduction and visualization method for correlation structures among variables and not really intended for a variable identification/elimination method as a precursor to building predictive models. There are many different techniques native to JMP and JMP Pro that can be used to directly build models even in the face of suspected correlated predictor variables. Techniques such as partial least squares, or in JMP Pro, many of the subplatforms in the Generalized Regression Fit Model personality are tailor made for variable identification in the face of correlated predictor variables.