The other day, I was asking about binary data and correlations. I had been using the multivariate correlation tool which outputs the correlation coefficients into a matrix. However, from further research, I don't think using this correlation tool is meaningful for categorical data because it uses the Pearson Product Moment Correlation. In my data, each value is a 0 or 1 (pass or fail) which is binary data. I have read that I should use the Phi Coefficient to calculate the correlation for binary numbers.
I have two questions:
1) Can I interpret the coefficient the same way as the PPMC? I know that the closer it is to +1, there is a strong positive correlation. If I square this coefficient, can I still use it to calculate coefficient of determination?
2) Can I calculate this using JMP? Is there a way to change the correlation matrix so it calculares this phi coefficient instead of the PPMC?
Here is a script that collects RSquare (U) and Kappa from all combinations of your binary columns:
Names Default To Here( 1 ); dt# = Current Data Table(); If( Is Empty( dt# ), Throw( "Data table missing" ) ); // user choices. dlg# = Column Dialog( Title( "Binary Agreement" ), yCol# = Col List( "Binary Columns", Min Col( 2 ) ), "Select columns for agreement" ); // check if user decides to quit. If( dlg#["Button"] == -1, Throw( "User cancelled" ); ); // process information returned from dialog. Remove From( dlg# ); Eval List( dlg# ); n cols# = N Items( yCol# ); r sqr u# = agree# = Identity( n cols# ); measure# = List(); For( col1# = 1, col1# < n cols#, col1#++, Insert Into( measure#, yCol#[col1#] << Get Name); For( col2# = 2, col2# <= n cols#, col2#++, ct# = dt# << Contingency( Y( yCol#[col1#] ), X( yCol#[col2#] ), Contingency Table( 0 ), Mosaic Plot( 0 ), Tests( 1 ), Agreement Statistic( 1 ), Invisible ); ctr# = ct# << Report; r sqr u#[col1#,col2#] = r sqr u#[col2#,col1#] = ctr#["Tests"][TableBox(1)][NumberColBox(4)]; agree#[col1#,col2#] = agree#[col2#,col1#] = ctr#["Kappa Coefficient"][TableBox(1)][NumberColBox(1)]; ct# << Close Window; ); ); Insert Into( measure#, yCol#[col1#] << Get Name ); New Window( "Binary Agreement", Outline Box( "RSquare (U)", tb1 = Table Box( String Col Box( "Measure", measure# ) ) ), Outline Box( "Kappa Coefficient", tb2 = Table Box( String Col Box( "Measure", measure# ) ) ) ); For( col# = 1, col# <= n cols#, col#++, tb1 << Append( Number Col Box( measure#[col#], r sqr u#[0,col#], << Set Format( 7, 4 ) ) ); tb2 << Append( Number Col Box( measure#[col#], agree#[0,col#], << Set Format( 7, 4 ) ) ); );
Categorical data is different from continuous data. There generally are analogous measures and methods, though. For example, instead of talking about correlation as we would with continuous variables, we talk about association or agreement with categorical variables. There are many such measures and JMP provides most of the important and popular measures.
The appropriate method and measure depend on the nature of the categories:
It sounds like you might be interested in the agreement between two variables that capture the same nominal, binary levels. You can use the Analyze > Fit Y by X command to launch the Contingency platform. The platform menu includes a command to obtain the agreement and another command to obtain the association measures.
Remember that you can select Tools > Help and then click on a report to go directly to the specific Help for that information.
Generally these measures are scaled like correlation so the interpretation is the same.
I do not know why you selected the Phi measure. We can talk about that, too.
Thank you for this! I tried the Contigency Analysis on two variabes.
My RSquare (U) is 0.9208- so if I understand this correctly, these two variables predict eachother's response since RSquare is fairly close to +1.
My p-value is <.0001. Since these two variables are binary, either pass or fail (1 or 0):
Ho: The result of one variable has no affect on the seond.
Ha: The result of one variable has an effect on the second.
Since my p-value is less than 0.05, I reject the null hypothesis in favour of the alternate hypothesis, correct?
I have 60 different binary variables. Is it possible to get the RSquare value into a matrix or data table? This would be similar to the multivariate correlation matrix, but a data table works well, too. I don't need to have all these different reports for what I am working on.
I was just researching the Phi Coefficient. It says that it quantifies the association between two binary variables. Is it not possible to make this calculation in JMP without writing a script?
The R Square (U) is based on likelihood but it is used exactly the same as R Square with continuous variables.
Yes, your interpretatoin of the p-value is correct.
A script could be written to launch the Contingency platform with each pair, collect the R Square (U) statistic, and report them in a custom window.
As to the Phi measure, you might instead research the ones that JMP offers to see if they might be useful. Again, since all of your variables measure 0 or 1, agreement might be a better measure than association.
Thank you very much for your reply! I am familiar getting pretty familiar with JSL, but I am not sure how I could collect the RSquare value.
Hi again Mark,
Thanks for this script; it's really great code! It is running well, and I spent some time learnig how you wrote it (I think I learned some more about scripting, too)!
In your code, I understand how you got the kappa value or R Square(U) from the report. If I manually do a Fit Y by X, I just don't see where to find the kappa coefficient. I would like to learn more about this coefficient and how it is calculated. Some of my data might have an entire column full of 1s, and when it is included in the contingency report, the report isn't made. From the log and from my debugging, I think this is because the kappa coefficient isn't being calculate or can't be. Why is this?
Ha, great question!
OK, the answer is that there must be variation to learn. If you always get the same answer no matter what the conditions are, then you can't determine what causes it to vary. The statistics can't provide answers that aren't there.