Subscribe Bookmark RSS Feed

Question about PCA

jy9550

New Contributor

Joined:

Apr 7, 2017

Hello,

 

I am wondering how to decide and determine the continuous variables for PCA process.

I am analyizing house sales price with about 80 variables, and there are too many continuous variables with wide rage. So I want to reduce the set of numerical variables for concise model, but I don't know which variables I should apply for PCA.

Please help me to figure out this. The file is attached below.

 

Thank you.

 

 

3 REPLIES
stephen_pearson

Community Trekker

Joined:

Oct 6, 2014

That is an interesting data set. You have many categorical descriptors some of which could be considered continuous. For example the location could be given using latitude and longitude.

Is PCA the technique you require? The individual distributions of the numbers can heavily influence the outcome of the PCA analysis.

Are you trying to predict future house prices or understand what the important variables are? A simple partition analysis gives similar result to generalised regression. 90 % of the variability in prices can be described using 4 variables, all of which are numeric but only two of them are continuous.
jy9550

New Contributor

Joined:

Apr 7, 2017

Thanks for the response.
I want to first understand what the important variables are, and then run
regression with those variables to predict future house sales price. Also,
I want to run the regression with all variables, and then compare the
result of regressions to obtain the best mode which should be simpler with
the least error at the end.
Peter_Bartell

Joined:

Jun 5, 2014

Principal Components Analysis is not considered a modeling method but an exploratory data analysis method with a goal of variable reduction. PCA examines various aspects of correlation structures that exist within a group of variables. If variable identification for predictive models is you major goal, then starting in the JMP or JMP Pro Fit Model platform is where you want to work. There are multiple modeling personalities supported there from good old fashioned ordinary least squares (call Standard Least Squares in JMP) to stepwise, general linear models, to name three. The partition platform provides an alternative modeling method which can also be useful for variable identification. If you are running JMP Pro the Generalized Regression platform's penalized regression methods are tailor made for predictive modeling where variable identification is a primary goal. In addition you've got all sorts of flexible model cross validation constructs within JMP Pro. Finally, the JMP Pro Model Comparison and Formula Depot platforms are great for comparing multiple models performance and, if needed, exporting the model to an alternative coding format like SQL, C, or SAS.