turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- JMP User Community
- :
- Discussions
- :
- Discussions
- :
- Binary Data and Correlations

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Dec 2, 2016 11:35 AM
(4244 views)

The other day, I was asking about binary data and correlations. I had been using the multivariate correlation tool which outputs the correlation coefficients into a matrix. However, from further research, I don't think using this correlation tool is meaningful for categorical data because it uses the Pearson Product Moment Correlation. In my data, each value is a 0 or 1 (pass or fail) which is binary data. I have read that I should use the Phi Coefficient to calculate the correlation for binary numbers.

I have two questions:

1) Can I interpret the coefficient the same way as the PPMC? I know that the closer it is to +1, there is a strong positive correlation. If I square this coefficient, can I still use it to calculate coefficient of determination?

2) Can I calculate this using JMP? Is there a way to change the correlation matrix so it calculares this phi coefficient instead of the PPMC?

Thanks, Natalie

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Dec 3, 2016 7:35 AM
(5281 views)

Solution

Here is a script that collects **RSquare (U)** and **Kappa** from all combinations of your binary columns:

```
Names Default To Here( 1 );
dt# = Current Data Table();
If( Is Empty( dt# ),
Throw( "Data table missing" )
);
// user choices.
dlg# = Column Dialog(
Title( "Binary Agreement" ),
yCol# = Col List( "Binary Columns",
Min Col( 2 )
),
"Select columns for agreement"
);
// check if user decides to quit.
If( dlg#["Button"] == -1,
Throw( "User cancelled" );
);
// process information returned from dialog.
Remove From( dlg# ); Eval List( dlg# );
n cols# = N Items( yCol# );
r sqr u# = agree# = Identity( n cols# );
measure# = List();
For( col1# = 1, col1# < n cols#, col1#++,
Insert Into( measure#, yCol#[col1#] << Get Name);
For( col2# = 2, col2# <= n cols#, col2#++,
ct# = dt# << Contingency(
Y( yCol#[col1#] ),
X( yCol#[col2#] ),
Contingency Table( 0 ),
Mosaic Plot( 0 ),
Tests( 1 ),
Agreement Statistic( 1 ),
Invisible
);
ctr# = ct# << Report;
r sqr u#[col1#,col2#] = r sqr u#[col2#,col1#] = ctr#["Tests"][TableBox(1)][NumberColBox(4)][1];
agree#[col1#,col2#] = agree#[col2#,col1#] = ctr#["Kappa Coefficient"][TableBox(1)][NumberColBox(1)][1];
ct# << Close Window;
);
);
Insert Into( measure#, yCol#[col1#] << Get Name );
New Window( "Binary Agreement",
Outline Box( "RSquare (U)",
tb1 = Table Box(
String Col Box( "Measure", measure# )
)
),
Outline Box( "Kappa Coefficient",
tb2 = Table Box(
String Col Box( "Measure", measure# )
)
)
);
For( col# = 1, col# <= n cols#, col#++,
tb1 << Append(
Number Col Box( measure#[col#], r sqr u#[0,col#], << Set Format( 7, 4 ) )
);
tb2 << Append(
Number Col Box( measure#[col#], agree#[0,col#], << Set Format( 7, 4 ) )
);
);
```

Learn it once, use it forever!

12 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Dec 2, 2016 11:50 AM
(4238 views)

Categorical data is different from continuous data. There generally are analogous measures and methods, though. For example, instead of talking about correlation as we would with continuous variables, we talk about *association* or *agreement* with categorical variables. There are many such measures and JMP provides most of the important and popular measures.

The appropriate method and measure depend on the nature of the categories:

- number of levels (2, 3, or more)
- modeling type (nominal or ordinal)
- whether the same levels are used for both variables

It sounds like you might be interested in the **agreement** between two variables that capture the same **nominal**, **binary** levels. You can use the **Analyze** > **Fit Y by X** command to launch the **Contingency** platform. The platform menu includes a command to obtain the agreement and another command to obtain the association measures.

Remember that you can select **Tools** > **Help** and then click on a report to go directly to the specific Help for that information.

Generally these measures are scaled like correlation so the interpretation is the same.

I do not know why you selected the Phi measure. We can talk about that, too.

Learn it once, use it forever!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Dec 2, 2016 1:06 PM
(4223 views)

Hi Mark,

Thank you for this! I tried the Contigency Analysis on two variabes.

My RSquare (U) is 0.9208- so if I understand this correctly, these two variables predict eachother's response since RSquare is fairly close to +1.

My p-value is <.0001. Since these two variables are binary, either pass or fail (1 or 0):

Ho: The result of one variable has no affect on the seond.

Ha: The result of one variable has an effect on the second.

Since my p-value is less than 0.05, I reject the null hypothesis in favour of the alternate hypothesis, correct?

I have 60 different binary variables. Is it possible to get the RSquare value into a matrix or data table? This would be similar to the multivariate correlation matrix, but a data table works well, too. I don't need to have all these different reports for what I am working on.

I was just researching the Phi Coefficient. It says that it quantifies the association between two binary variables. Is it not possible to make this calculation in JMP without writing a script?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Dec 2, 2016 1:28 PM
(4217 views)

The R Square (U) is based on likelihood but it is used exactly the same as R Square with continuous variables.

Yes, your interpretatoin of the p-value is correct.

A script could be written to launch the Contingency platform with each pair, collect the R Square (U) statistic, and report them in a custom window.

As to the Phi measure, you might instead research the ones that JMP offers to see if they might be useful. Again, since all of your variables measure 0 or 1, agreement might be a better measure than association.

Learn it once, use it forever!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Dec 2, 2016 2:05 PM
(4211 views)

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Dec 7, 2016 10:43 AM
(4115 views)

Did my script work for you? Did you have any difficulty?

Learn it once, use it forever!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Dec 7, 2016 9:46 PM
(4104 views)

Thank you for the script. I am actually away thus week from work but I will certainly be trying it on Monday.

Thanks again!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Dec 12, 2016 1:53 PM
(4080 views)

Hi again Mark,

Thanks for this script; it's really great code! It is running well, and I spent some time learnig how you wrote it (I think I learned some more about scripting, too)!

In your code, I understand how you got the kappa value or R Square(U) from the report. If I manually do a Fit Y by X, I just don't see where to find the kappa coefficient. I would like to learn more about this coefficient and how it is calculated. Some of my data might have an entire column full of 1s, and when it is included in the contingency report, the report isn't made. From the log and from my debugging, I think this is because the kappa coefficient isn't being calculate or can't be. Why is this?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Dec 13, 2016 5:55 PM
(4055 views)

Ha, great question!

OK, the answer is that there must be variation to learn. If you always get the same answer no matter what the conditions are, then you can't determine what causes it to vary. The statistics can't provide answers that aren't there.

Learn it once, use it forever!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Dec 14, 2016 7:15 AM
(4034 views)