Subscribe Bookmark RSS Feed

Exploring correlations in a large dataset

tnad

Community Trekker

Joined:

Jun 10, 2016

I'm using JMP pro and working with a large data set of plant process variables that I want to explore how they might affect production or some other performance parameters in the plant. Are there any guides or resources on how I can start with this? I've been going through JMP videos on youtube but they don't go very deep into analysis. So far, I've only figured out  that I should "screen predictors" to reduce the number of factors I'm looking at. Not sure if next would be fitting to regression models or further exploring data with PCA, pairwise correlations...etc or if I need be careful if my data is not normal data or has a lot of zeroes before choosing the right analysis.

Also, I saw some nice plots using canonical correspondence analysis (CCA) to explore correlations with data like mine in R. Is there anything similar/better in JMP?

I'm new to this kind of analysis, so any ideas you have on how to get started will help. Thanks.

8 REPLIES
txnelson

Super User

Joined:

Jun 22, 2012

I typically take the following approach to screen large data:

1. I run the Distribution Platform on all of the data looking for how much of my data is highly skewed.  If I have a large number of rows, I will have a tendency to delete outliers in an attempt to normalize the data.  Other times or in addition to, I will transform the troublesome columns.  I use scripts to ease the laboriousness of this.

2. The next step is to screen out redundant columns.  I typically use 3 different platforms for this.

a. PCA is great if the data is, or has been transformed to a very rough normal form.

b. Response Screening gives a great output data table that really lets me get a feel for the data

c. Bootstrap Forrest partitioning, shows which columns are the strongest potential contributors for my detailed analysis that I will be doing.

Jim
tnad

Community Trekker

Joined:

Jun 10, 2016

Thank you all for the valuable information. I have some followup questions regarding transforming the data:

1. From what I understand, you transform some columns but not others? Shouldn't all data be transformed the same way for downstream analysis?

2. I have some columns with many zeroes. Log transformations seem to normalize some but convert zeroes into missing data. Not sure if there's an easy way to do this or some guides on how to handle this in JMP.

3. If normalizing the data is not possible or not essential, is there a preferred method or ones to avoid (PCA?) to further analyze such data?

I'm guessing the step following the transformations and screening would be using the strongest contributors in the "fit model" platform and comparing models?

txnelson

Super User

Joined:

Jun 22, 2012

1.  The transformation of the data needs to be handled on a column by column basis.  When the analytics are performed on transformed data, it is to allow for the statistics to be able to correctly calculate the parametrics.  One does not report the data using the transformed values.

2.  I find the best way to determine the correct transformation is to use the Distribution platform in JMP, and request under the red triangle,

     Continuous Fit==>All

It will give you the results of what the distribution of your data are, and from there you will be able to save your data in a transformed state.  If the distribution it discovers to be the best description of your data does not have a transformation available, then you might want to see if the GLog(Generalized Log) transform helps you out. 

3.   If you have a targeted response variable, I have used Bootstrap Forrest and Boosted Tree to find columns to use to pair down the data. They do not require normal distributed data. 

Jim
KarenC

Super User

Joined:

Feb 10, 2013

Jim has provided a number of good steps for you to consider. You might also use the control chart builder (with such data I would use the column switcher so you can step through your variables). The reason for looking at the control charts is that process data has a time element to it.  Process data is great fun but is not always easy (the data itself is complex given time lags, missing data, etc., etc.). I find for such data that value is created when you work as a team that includes at minimum an analytical expert (i.e., someone wth a statistical background) and a process expert (someone who really knows the process).

rabelardo

Community Trekker

Joined:

Mar 30, 2016

Hi karen@boulderstats​.

Domain (Process expert) + Method (Analytical expert) is a great partnership.

I enjoyed and learned a lot from your Analytically Speaking episode last week !

I'm also learning so much from the JMP community experts who are great in providing guidance.

Much appreciated.

- Randy, JMP newbie

Peter_Bartell

Joined:

Jun 5, 2014

I'm presuming you've inherited this data from some sort of data warehouse/historian system and you've got lots of columns and or perhaps lots of rows. How much time have you spent looking at data quality from these perspectives:

1. Outliers? A good place to look at this issue is within the Cols -> Modeling Utilities -> Explore Outliers path.

2. Missing Values? A good place to look at this issue is within the Cols -> Modeling Utilities -> Explore Missing Values path.

3. If you've got nonsense values, things like '9999' codes...think about using JMP's Cols -> Utilities -> Recode path to fix/repair these.

4. As a last data quality act, and if you think you'll be proceeding to building predictive models, make sure to use JMP Pro's Cols -> Modeling Utilities -> Make Validation Column platform to create a Validation column containing, if appropriate, a Training, Validation, and Test construct.

Once data quality/cleanup has been completed then there are several JMP Pro platforms that you may find helpful. Each have their place in the sun. Principal Components Analysis with Clustering, Fit Model -> Partial Least Squares, Fit Model -> Generalized Regression (then pick your sub personality based on the specifics of each situation), if building and evaluating models make sure you invoke the Missing Value imputation (if needed and appropriate) and leverage the Validation column you've created. Other JMP modeling platforms could also be valuable...I'm just focusing on the JMP Pro ideas in this post.

Steven_Moore

Super User

Joined:

Jun 4, 2014

There is an awesome JMP add-in available called Scagnostics.  I use it often to help see the structure of correlations within large data sets.

Steve
KarenC

Super User

Joined:

Feb 10, 2013

https://community.jmp.com/docs/DOC-9923


The above add-in is another option for looking at x-by-y correlation. I used it recently on a process data project.