turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- JMP User Community
- :
- Discussions
- :
- Discussions
- :
- Identifying Columns Best Correlated to A Single Column

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Feb 5, 2013 11:43 AM
(2021 views)

Hi,

I have a table with a "result" column and about 100 process variable columns. I want to know which process variable affect my result most.

One simple way is to plot Each Process Variable Vs. Result and take a look at the box plots for correlations. But, visual inspection of 100 columns is pretty tough.

Is there a way to automate that process.

I appreciate your response.

Seshadri

5 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

If you want the highest correlation, this is simple to do, using PROC CORR. Output the correlations to a data set, take the absolute value of the correlations, and then sort.

However, I have worked with manufacturing process data for quite a while, let me urge you to modify the goal. Often, you will have several variables that have similar correlations, and the one that has the highest correlation could be "not statistically different" correlation-wise from the one that is in second place or third place. And it may turn out that the three correlations all relate to one single phenomenon, or that based upon first principle knowledge the third variable is most likely related to the true "cause". So selecting the highest correlation might not reveal this information.

So while it is tempting to look at the "highest correlation", you might really want to look at the several highest correlations. You might want to use a modelling technique that really accounts for these ideas, such as Partial Least Squares (PROC PLS), which provides you with a vector indicating what variables are most correlated with the response, and a bar chart of the loadings will provide you with a visual of what process variableS (plural) are highly correlated with your response. The use of PLS on manufacturing process variables is a large field of study, with many journal articles written about it, and many many successful applications.

The other problem here is that the presence of one or more outliers can greatly distort the correlations. How do you find outliers in a large data set like this? That's an entirely different issue, but one you should have in the back of your mind.

UPDATE: Okay, I gave the answer for SAS, I see you are using JMP, but the same ideas apply.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

what i do is use the pairwise correlations option and then sort the tablet. you can sort by pvlue or by the correlation itself to see which is most useful for you.

to get the pairwise correlation you can click on the red triangle and select it from the menu. if you want it on a permanent basis go to file>>preferences>>platforms>>multivariate and tick the box for "pairwise correlations.

to sort the table you just need to right click on the pairwise correlation table itself and choose "sort by column".

unfortunately, i do not know a way of getting the table sorted by default or scripting the sorting. this is also a change that when done is not recorded to the script if you "save script to data table"

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Seshardi writes: "I want to know which process variable affect my result most". This is not a correlation problem - this is Best Variables problem: searching through a large number of variables to determine a short list of the most influential factors to use in a model. There are several methods for addressing this question. I find a Bootstrap Forest to be very helpful for this purpose. George Hurley presented a paper on Bootstrap Forest in JMP at the 2012 NESUG conference - here is the Lex Jansen link: http://www.lexjansen.com/mwsug/2012/JM/MWSUG-2012-JM04.pdf. I did a presentation on Bootstrap Forest in Base SAS at NESUG but I understand you are working in JMP. My NESUG presentation was just a sub I wrote overnight to fill in for an author who unfortunately had to cancel; I don't think it's in the proceedings but I will be posting a slide show on it in the next few days.

If there is specific outcome you want to predict that is amenable to regression analysis, you will want to consider the XCSTAT macro by Raimi and Lund: http://www.mwsug.org/proceedings/2011/stats/MWSUG-2011-SA03.pdf. It's only available in SAS at this time, not JMP.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

I have a macro, %BESTCORR, that computes the set of variables that are highly correlated with a dependent (response) variable but minimally correlated with each other. The macro is parameterized, so you can supply the definition (minimum correlation for multicollinearity), lower threshold on correlations for variables considered, and number of variables to report.

If you want the code, it's free. Just send me a message and I will e-mail the macro to you (and anyone else who might find it to be useful).

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

I have a similar recurring challenge--except mine is not always posed with respect to a "result" column. The question I am frequently asked is, "What combination of columns [variables, attributes, whatever you want to call them] are most prevalent in the data?" I've considered the following techniques: decision trees, cluster analysis, and regression-type methods. All require a "target" or dependent variable of some sort. I'm interested in just how the data are distributed amongst a host of variables--for what combination of these variables is the density (number of cases) greatest?

I'm probably not helping you, seshadri, but need guidance also on this similar issue. I'm going to read the posts already suggested--but if somebody has some additional ideas, please post them!

Thanks.