Subscribe Bookmark RSS Feed

How to test differences in F1 scores?


Community Trekker


Mar 15, 2016

Hi all,

  I've evaluated the performances of several classifications models (e.g., BayesNet, Random Forrest), on several datatsets, by measuring the F1 score (F1 score - Wikipedia, the free encyclopedia) achieved by a ten fold cross validation.My data, ha hence the following columns: classification model, dataset, F1 score.

Now I want to test if:

1)There is a statistical difference among predictors.

2)There is a statistical difference between the best predictor and all the others.

My approach would be to do:

-Fit y by x, with y=F1 and x=classification model

-Non  parametric Wilcoxon test: this will answer point 1.

-Non parametric multiple comparison: Wilcoxon test: this will answer point 2.

However, I see other tests than Wilcoxon and I wonder if what I am doing is correct.

Thanks for your help,





Jun 23, 2011

Hello Davide,

I think your approach is reasonable.

Another method you might look at would be Oneway > Compare Means > With Best, Hsu MCB. This is a multiple comparison procedure that tests if each level of the X variable is significantly different than the "best" level. The output will show p-values for comparing all levels with the max as well as with the min.

You can find some information on the option in the Fit Y by X platform in JMP here: Compare Means


Michael Crotty
Sr Statistical Writer
JMP Development